m 



Long Atomic Computations 

by 

Puih4g 

August 1986 



© Massachusetts Institute of Technology 1986 

This research was supported by the Advanced Research Projects Agency 
of the Department of Defense, monitored by the Office of Naval Research 
under contract number N0CX)14-83-K-0125. 



Massachusetts Institute of Technology 
Laboratory for Computer Science 
Cambridge, Massachusetts 02139 



Long Atomic Computations 



by 

Pui Ng 

Submitted to the 

Department of Electrical Engineering and Computer Science 

on August 26, 1986 in partial fulfillment of the requirements 

for the Degree of Doctor of Philosophy 

Abstract 

Distributed computing systems are tjecoming commonplace and offer interesting 
opportunities for new applications. In a practical sy^em, the prot>lems of 
synchronizing concurrent computations and recovering from failures must be dealt 
with effectively. Atomicity has been suggested as a tool that masks concurrency and 
failures from the users of a system. With synchronization and recovery mechanisms, 
atomic computations appear to execute indivi8it)iy. This dissertation addresses the 
issues in implementing long atomic computations, such as computations that last for 
hours or even days. Long computations msrtie synchroriization more difficult 
because their execution is more overlapped. They are also more likely to encounter 
failures in their executkxi. 

Three issues are raised: 

1. Should long computations be executed atomically? Or should atomicity 
be reF>laced with other correctness criteria to increase the concurrency 
of a system? 

2. If long atomic computations can be implemented practically, are there 
implementation paradigms that application programmers can follow to 
simplify the programming effort? 

3. How can long atomic computations be msKte resilient to transient 
failures? 

This dissertation shows that by using the semantics of an application, a system that 
supports atomic computations can be made as concurrent as other systems that do 
not. Since atomicity is easier to underhand than other correctness criteria, systems 
that support k>ng atomic computattons are prefer^)le. 

Using the semantics of an applicatton requires application-dependent 
synchronizatton and recovery code, which can be compHcated and introduce subtle 



errors easily. Several synchronization and recovery paradigms are investigated In 
this dissertation. The paradigms divide synchronization and recovery into levels so 
that the task at each level is simpler. A programming interface that hides the 
concurrency control algorithm used by a system implementation is also presented. 

Finally, this dissertation discusses the use of checkpoints and buffered messages to 
increase the resilience of long atomic computations. 



Thesis Supervisor: David D. Clark 
Title: Senior Research Scientist 

Keywords: Distributed Systems, Atomicity, Concurrency Control, Long 
Computations, Recovery, Fault Tolerance, Reliability, Programming Methodology. 



Acknowledgments 



First I would like to thank my advisor, Dave Clark, for his guidance. He always 
provided me with fresh insights and imparted his enthusiasm. In «Jdition, I would like 
to thank my readers, Dave Gifford and Bill Weihl, for their patience and suggestions 
that make the thesis more readable. Bill also offered many detailed comments, which 
forced me to think through many issues more thoroughly. 

Numerous people on the fifth floor provkled both technical and emotional support. In 
particular, Jim Gibson, Brian Oki, and Lixia Zhang read drafts of my thesis and gave 
useful suggestions. There are many others that ^owed their frierKlship and concern 
through their words of encouragement. I thank ttiem all. 

Many brothers and sisters in my church had in a sense written this thesis together 
with me. I cannot say enough to thank their support in prayer and in fellowship. 
Their support started with my k>oking for a thesis topic and kept me going. 

I thank Elaine for her encouragement, her love, and her patience with me. 

Finally, my family members, especially my parents, had shown their unceasing love 
and faith throughout the many years spent in writing this thesis. I would like to 
express my deepest appreciation. 

May all the glory be to God. 



Table of Contents 



Chapter One: Introduction 1 1 

1.1 Long Atomic Computations 13 

1 .2 Concurrency and Resilience Problems 1 4 

1.2.1 Concurrency Problem 14 

1 .2.2 Resilience Problem 1 8 

1 .3 Contributions and Solutions 19 

1 .3.1 Functionality - Concurrency Trade-Off 21 

1 .3.2 Implementation Paradigms 22 

1 .3.2.1 Level Atomicity 24 

1.3.2.2 Conflict Model 26 

1 .3.2.3 Programming Interface 28 

1 .3.2.4 Concurrency CcMitrol Algorithms 29 

1 .3.3 Resilience Problem and Its Solutions 30 

1 .4 Roadmap 30 

1.5 Related Work 31 

1 .5.1 Predicate Locks 31 

1 .5.2 Schwarz's Thesis 31 

1.5.3 Altchin's Thesis 32 

1. 5.4 WeihI's Thesis 33 

1 .5.5 Garcia-Motina's Semantic Constancy 34 

1 .5.6 Montgomery's Thesis 34 

1 .5.7 Gifford's Persistent Actions 34 

1.5.8 She's Thesis 35 

1 .5.9 Miscellaneous 36 

Chapter Two: System Model 39 

2.1 Physical Environment and Assumptions 39 

2.2 Model of Computation 40 

2.3 Atomicity 43 

2.3.1 Event Model 44 

2.3.2 State Machines 46 

2.3.3 Atomic Histories 48 

Chapter Three: Using Application Semantics 51 

3.1 Conflict Model 53 

3.1.1 Generating Atomic Histories 53 

3. 1 .2 Guaranteeing Equivalence to Serial Histories 54 

3.1 .3 Gen^ating Atomic Behavior 55 



6 



3. 1 .4 Generating Valid Results 56 

3.1.5 Conflicts 57 

3.1.6 Conclusion 58 

3.2 An Example 59 

3.2.1 Read.Balance Operations 59 

3.2.2 Withdraw Operations 60 

3.3 Deriving Conflict donditions 63 

3.4 Increasing Concurrency 64 

3.4.1 Reducing Precision of Numerical Results 65 

3.4.2 Conditional Operations 67 

3.4.3 Discussion 69 

3.5 Summary 70 

Chapter Four: Implementing Atomic Objects 71 

4.1 Overview of Implementation Paradigms 73 

4.1 .1 Lower-Level Synchronization and Recovery 73 

4. 1 .2 Higher-Level Synchronization 75 

4. 1 .3 l-ligher-Lev^ Recovery 76 

4.2 Glot)al Atomicity and Local Atomicity 77 

4.2.1 Definitions of Glot>al Atomicity and Local Atomicity 77 

4.2.2 Implementing Locally Atomic Computations 78 

4.2.3 Related Wortt 80 

4.3 Synchronization 80 

4.3.1 History Objects 81 

4.3.1 .1 Maying Concurrency Control Aigorithnf» 82 

4.3.1 .2 Advantages and Disadvantages <^ Transparency 84 

4.3.2 Resolving Conflicts 85 

4.4 Recovery 87 

4.4.1 Intentions list Paradigm 88 

4.4.2 Undo Log Paradigm 90 

4.5 Programming Interface ^ 

4.5.1 History Ot^ects Continued 94 

4.5.2 Transition Ot^jects 95 

4.5.3 Template Objects 96 

4.5.4 Resource Managers 97 

4.6 F^ogram Examples 99 

4.7 Implementation Trade-Offs 106 

4.7.1 Comparison of Recovery Paradigms 106 

4.7.2 Impienranting Atomic (Ejects with Atomic Obfects 107 

4.7.2. 1 Two Approaches to implement a Bank Object 112 

4.7.2.2 Comparison of the Two Apfyoachee 117 

4.7.3 Partitioning and RefMicating Hi^ory Ot^ecte 121 

4.8 Conclusion 126 



Chapter Five: Concurrency Control Algorithms 128 

5.1 Concurrency Control Algorithms 130 

5.1.1 Static Concurrency Control Algorithms 130 

5. 1 .2 Dynamic Concurrency Control Algorithms 1 33 

5.2 Improving Concurrency with Concurrency Control Algorithms 1 35 

5.2.1 Hierarchical Concurrency Control AI^Trithm 136 

5.2.2 Time-Range Concurrency Control Algorithm 1 38 

5.3 Making Concurrency Control Algorithms Transparent 1 46 

5.3.1 Implementation of History Operations 147 

5.3.2 Implementation of Retry Statement 148 

5.4 Commit Protocols 155 

5.4.1 Tvwj-Phase Commit Protocol 155 

5.4.2 One-Phase Commit Protocol 157 

5.5 Summary 159 

Chapter Six: Power of Atomicity 160 

6.1 Informal Proof of Power of Atomicity 164 

6.2FormalProof of Power of Atomicity 167 

6.2.1 Atomicity 167 

6.2.2 Consistency 168 

6.2.3 Proof 169 

6.3 Objects with Simple Serial Specifications 1 76 

6.3.1 Accurate Objects 177 

6.3.2 Specifications of Accurate Ot^ects Can Be Raised 1 79 

6.3.3 There Are Many Accurate Objects 181 

6.4 Conclusion 186 

Chapter Seven: Resilience 1 88 

7.1 Checkpoints 190 

7.1.1 Checkpoint Time 191 

7.1.1.1 Checkpointing a Program 1 92 

7. 1 . 1 .2 Prc^[>agating a Checkpoint to Previously Invoked Sub- 1 94 
Programs 

7.1.1.3TWO Kinds of Checkpoints 195 

7. 1 . 1 .4 Propagating a Checkpoint to Ancestor Programs 1 96 

7. 1 .1 .5 Checkpointing Parallel Sub- Actions 1 99 

7.1.2 Restart Time 199 
7. 1 .2. 1 1dentifying the Re^artat)le l=Yogram 1 99 

7.1.2.2 Re^arting a Program 201 

7. 1 .2.3 Other Types of Failures 202 

7.2 Message Tranter Agewits 202 

7.3 Conclusion 205 



8 



Chapter Eight: Conclusion 207 

8.1 Summary 207 

8.2 Future Work 210 

8.2.1 Other Communication Primitives 210 

8.2.2 Hardware Configuration and Reliability 21 1 

8.2.3 Replication 212 

8.2.4 Implementation Experience 213 

8.3 Conclusion 214 



Table of Figures 



Figure 1 - 1 : A Globally Atomic Computation Implemented with Locally 25 

Atomic Computations 

Figure 2- 1 : States of a Computation/Action/Operation 42 

Figure 2-2: A State Machine for a Set 47 

Figure 3-1: A State Machine for a Bank Account Object 54 

Figure 3-2: A History and a Transition Sequence 55 

Figure 4- 1 : Interface of a History Ot^fect 82 

Figure 4-2: Interface of a Transition CX^ject 95 

Figure 4-3: An lmi:Mementation of a Set RM vMth the Intention List 100 

Paradigm 

Figure 4-4: An Implementation of a Bank Account Object with the Undo 103 

Log Paradigm 

Figure 4-5: An Implementation of a Bank Account Object with the 108 

Intention List Paradigm 

Figure 4-6: A State Machine for a Bank Object 111 

Figu re 4- 7 : An Implementation of a Bank Object with the Intention List 1 1 3 

Paradigm 

Figure 4-8: A Simple implementation of a Bank ^ect 116 

Figure 4-9: A Simple Implementation oi a Bank Account Object 1 1 7 

Figu re 4- 1 0: Two Different Approaches of Imptemer^ing a GUobaiiy 118 

Atomic Bank Ot^ect 

Figure 4- 11 : A Specialized State Machine for a Bank Account Object 119 

Figure4-12: AStirteMachirYeforaSemt-Qumje 123 

Figure 4- 1 3: An Implementation of a SenrU-Queue Ob^Mt 1 24 

Figure 5-1 : Implementations for Sub and Prior 148 

Figu re 6- 1 : Speciftoation of a Bank Account Object in a Consistent 1 78 

System 

Figure 7- 1 : Partitions that Prevent Communicertion 189 

Figure 7-2: A Program Using Checkpoints 194 

Figu re 7 ■ 3: Propag^rting Checkpoints to Ancestor Programs 1 98 

Figure 7-4: Using Parallel Sub-Actions to ^secify Applicatton Time-Out 200 

Figure 7-5: RdlbaK^ due to Deadtod^ or InvalkiD^oend^iK^y 203 

Assumptions 



10 



Chapter One 
Introduction 



Distributed systems have Ijecome a reality with the increasing employrr^nt of 
workstations, home computers, and different types of computer communications 
equipment. Distributed computing has offered many opportunities to build new types 
of applications. Th^e applications are characterized by activities that span multiple 
sites of a distributed system. For example, a travel agent may make several 
reservations in different airline, hotel, and car rental reservation systems. A bank 
customer may withdraw money from his account over a geographically distributed 
banking network. An employee in an office may schedule a meeting with several of 
his colleagues using a calendar system that runs on multiple workstations and 
portable computers. 

However, as the numt>er of sites connected in a distributed system grows, It also 
tiecomes Increasingly likely that some components of the system are broken at any 
given time. Furthermore, the job of synchrortlzing concurrent activities becomes 
more difficult. It Is unrealistic to use any centralized scheduler when many users may 
be initiating activities at the same time. 

Atomicity [17, 28] has been suggested as a useful tool that alleviates these 
synchronization aruj reliability prot)iems. Under the atomicity model, the activities In 
a distributed system are modelled as a collection of atomic computations. A 
computation is a unit of work initiated by a user or by the system itself. Atomic 
computations are computations that inopear to execute serially in a certain 
serialization order. This serializability property frees the programmer from worrying 
about concurrent computations Interleaving witti one am^her. In addition to the 
serial behavior, an atomic computation Is either committed or aborted. The effects of 



11 



a committed computation become visible to all computations executed subsequently. 
The effects are also permanent so that they are not lost with transient failures such 
as power outages. When a computation is al)orted, any work performed by the 
computation is undone and the computation appears never to have executed. This 
all-or-nothing property is called failure atomicity. It lessens the burden on application 
programmers by undoing computations that are partially done. 

In this dissertation we consider how long atomic computations can be supported in a 
distributed system. As the size of a distributed system becomes larger, it is inevitable 
that the lengths of computations also increase. With a large system, it is unrealistic 
to expect every component to be highly reiiat>le given the high cost of such 
components. As a result, comnnunication delays, network partitions, and 
unavailability of critical resource due to site crashes, are just some of the reasons 
why computations may execute for a k>ng time. In fact, k>ng computations can be 
created simply because there is much work to be done as a single unit, or because a 
computation requires human interaction. Consequently, tong computations are not 
limited to distributed systems. 

The increase in computation lengths exac8rt)ate8 ttie synchrontzatk>n and reliat>ility 
problems. As each computation executes for a k^nger period of time, there is more 
overlapping of execution, which increases the likelihood of some of the computations 
being delayed. It also t)ecomes more likely to encounter a failure during the 
execution of a kmg computation. Current distritMJted sy^ems supporting atomic 
computations [31 , 56] do not provkJe adequate support to tong atomic computations. 
These systems do not provkJe any facilities for a computation to make its 
intermediate state re^ient to transient failures. Also, because of an implicit model of 
short computations, it is considered acceptat>le to delay one computation pending 
the completk>n of another. In a system with tong computatiorts, such delays are 
usually unacceptabia. 

The rest of this chapter is divkJed in the following way. Section 1.1 describes (Hir 



12 



definition of long computation more carefully and gives examples of such 
computations. Section 1 .2 discusses the major problems in supporting long atomic 
computations. Section 1.3 summarizes our solutions and contributions toward 
solving these problems. Section 1 .4 presents a roadmap for the thesis. Section 
1 .5 gives an overview of related work. 

1.1 Long Atomic Computations 

A computation may execute for a long time because of extensive computing or 
waiting for I/O events (e.g., waiting for input from keytx)ard or network). For 
example, a computation that requires human interactkm can last for minutes or even 
hours. Clearly, the length of a computatton is a relative measure. Instead of using an 
absolute numerical definition for long computations, we concentrate on 
computations that may require special support due to their length. Whether such 
support is needed depends on the length of computations and on the characteristics 
of the system on which they are executed. For example, the concurrency control 
algorithms, the system usage characteristrcs, and ttte mean-time-between-failure 
characteristics of the hardware are some of the factors that affect the response time 
and resilience oi a system. In a typical distritnjted system that supports atomic 
computations [31, 56], computations that last hours or days can be considered long 
tiecause they are prone to tie aix>rted and induce tong delays in concurrent 
computations. Shorter computations that Iset minutes or even seconds can also be 
considered long if the hardware te unreliat)le or the astern is heavily used. 

In our discussions we will focus on long computations whose lengths can be 
attributed to long delays in network communication. Several factors can contribute 
to these long delays: 

- mobility of sit^, such as disconnection of portable computers, 

- unreliable links in the network causing partittons, 

- slow links or switches, 

- economic reasons: serKHng messages t)atched is less expensive, 

- security that is enforced t)y isolation. 



13 



We believe that our work is also applicable to other types of long computations 
because of the similarity of the problems encountered in supporting them. 

Many applications require long computations. For example, a computation that 
schedules a meeting among several (jersonal calendar servers can last for hours or 
days because some of the calendars reside on portable computers and are 
disconnected from the sy^em. A replicated database [11] may propagate the 
updates to a replicated data otiject over a long period of time. A computation making 
several airline, hotel, and car rental reservations may last too long compared to the 
concurrency requirements of an airiine reservation system. 

1.2 Concurrency and Resilience Problems 

In the previous section we alluded to a concurrency problem and a resilience 
problem with k)ng atomic computations. Intuitively, a system is bound to create a 
concurrency problem when it is trying to maintain an image of sub^antially 
overtapped computations executing serially. A resilience (xobtem is also to be 
expected because it is more likely to encounter a transient failure in the execution of 
a long computation than in a short computation. This section descrit)e8 these 
problems more concretely by describing how some ^^atems [31 , 48, 56] implement 
atomicity. We argue that a long atomic computation causes long d^ays in 
concurrent computations and is ^one to be aborted in these implementations. 

1.2.1 Concurrency Problem 

In most eariier work [46, 40, 48, 26, 7], a (distributed) system is modelled as a 
collection of objects with read/write operations. A computation is nnxlelled as a 
sequence of read/write operations on the objects accessed t>y the computation. In 
order to guarantee s«ializat>ility and failure atomicity erf atomic computations, each 
object is implemented to behave "atomically:" tiie values returned by the read 
operations should be klentical to those returned had the committed computations 
been executed in some serial order common to firil the objects. 



14 



In general, two different types of algorithms are used to ensure such atomic behavior. 
In a locking algorithm, an object is associated with a read/write lock [31, 56, 17]. A 
read (write) lock is acquired before a read (write) cH^eratlon is executed. Two locks 
conflict with each other unless they are both read locks. When a computation 
requests a lock, it is delayed until ail other computations that had previously acquired 
conflicting locks are completed. This locking algorithm is called 2-phase 
locking [1 7]. In a timestamp algorithm, computations are assigned timestamps when 
they are started [48]. A computation is aborted and restarted if it tries to write an 
object that had already been read by another computation with a larger timestamp. If 
a computation with a timestamp t tries to read an object, it is delayed until the 
computation that has the largest, yet smaiier than t, timestamp among ail the 
computations that had written that object is committed. 

When long computations are executed, neither type of algorithm results in a 
satisfactory level of concurrency. In the locking algorithm, a long computation 
causes other computations that attempt to acquire conflicting locks to t>e delayed 
until it is completed. Worse yet, computations can be deadlocked with one another, 
so that one of them has to be aborted. Wh«i a deadlock occurs, there is the cost of 
detection, which usually involves passing messages among sites [43], and the cost of 
restarting the computation. Although there is no empiriced data on the frequency of 
deadlocks in a system with long computations, one can expect deadlocks to be more 
frequent than in a system with only short computatk>ns. as locks are hekj for longer 
periods of time. 

The long delays caused by incomplete computations are also pc^sible in a timestamp 
algorithm. In addition, a long computation can be aborted due to other computations 
with larger timestamps reading the ejects that it is ^ing to write. Normally, to make 
sure that computations are serialized approximately in the order that they are 
invoked, timestamps are assigned from real-time clocks. Ckxisequently, a 
computation becomes more likely to be aborted when it gets tonger, because more 
computations are started while rt is being executed. 



15 



The following example illustrates the concurrency problem. Consider a personal 
calendar application that consists of many personal calendars, each owned by a 
different user. Each user can read his own calendar {read calendar), reserve a time 
slot in his calendar {mark), and un-reserve a time slot {delete). Readcalendar returns 
a list of slots, some of which are reserved by previous mark operations. The mark 
operation can return okay or slot filled depending on whether the proposed slot has 
already been reserved. Delete un-reserves a slot and r^ums okay if the user is 
permitted to do so. Otherwise cannotjdelete is returned. All of these operations, 
except readcalendar, require updating a calendar. On top of these operations, the 
calendar application can construct computations that set up a meeting among 
several calendars {set jjp, meeting), or computations that cancel a meeting 
{canceljmeeting). For example, setjjp.meeting would invoke a mark operation at 
each of the calendars involved. 

A word of terminology is needed before we proceed with the exann^te. In this thesis, 
a computation is modelled as a sequence of operations on some objects. It should 
be emphasized that these ot>iects are abstract objects supporting abstract 
operations, such as the calendar object described above. A simplified view of the 
system is to regard each operation, such as a mark operation, as relatively short, 
while a computation, such as a setjjp.meeting computation, spends most erf its time 
delivering messages ecross a network to invoke operations. 

A computation that involve multiple calendars may spsm a long period of time 
t)ecause some of the computers involved may be disconnected from the system 
either physically (because they are portable) or functionaiiy (because they are not 
running the calendar software). Setjjp.meeting and cancel. meeting computations 
belong to this categ<xy. 

Obviously, if each calendar obje<^ is rmpimiented with a single read/write lock, the 
level of concurrency would be unacceptaMy k>w. For example, it is unacceptable to 
render ail the calendars of a meeting's participants inaccessible until a disconnected 



16 



participant is reconnected to complete a setjjp meeting computation. The 
timestamp algorithm has similar problems. We will omit it in the discussion below 
unless it offers interesting alternatives to the locking algorithm. 

Concurrency can be increased by dividing a calendar Into slots and associating a 
read /write lock with each slot. However, the concurrency of the implementation may 
still be unacceptably low. For example, consider the situation In which the owner of a 
calendar is trying to read his calendar wtien the calendar is the participant of an 
incomplete setjjpmeeting computation. Following the read/write lock algorithm, the 
readjcalendar operation will be delayed until the set.up meeting computation is 
completed. This is clearly unacceptable. 

One may argue that a timestamp algorithm offers a solution in this situation. By 
choosing a smaller timestamp for the computation that invokes read.calendar than 
that of the setjjpmeeting computation, the read, calendar operation can return the 
state of the calendar before the setjjpmeeting computation is executed. However, 
this solution is not without its problems. Suppose the owner of the calendar decides 
to reserve the slot occupied by the setjjpmeeting computation for some other 
purpose. The request cannot be accepted because the slot had already been 
promised to the setjjpmeeting computation, albeit tentatively^ On the other hand, 
the request cannot be delayed or rejected eiih&r because an inconsistent picture will 
be presented: by otiserving the state of the calendar before the setjjpjneeting 
computation is executed, the user is led to believe that the slot is empty and expects 
the request to readily succeed. 

One may consider this example as an argument against having long atomic 
computations. Arguing intuitively, we cannot expect an implementation to hide the 



^Depending on how different sites of a distributed computatk)n decide wtiether the computation 
should tie comrratted or irisorted. a site may be able to atort an incomplelB computation unilaterally [1 7]. 
However, there is atoo a window of vulnerability in wNch such wiiMaral aborts are not aUowed. ITiis 
window can span a long period of time if communicirtion delays are long. In any case, it is rather 
counterproductive to abort any incomplele 8»t,upjnt9ting computirtians wtfienever a calendar is read. 



17 



fact that there are multiple users using the system in substantially overlapF>ed periods 
of time. Hence, atomicity may have to be replaced with some other correctness 
criterion. One of the contributions of this thesis Is to show how atomicity can be 
employed even with long computations. Section 1 .3 will describe how atomicity can 
be used in conjunction with non-determinism to solve the concurrency problem. 
Since atomicity is not abandoned, the simplicity offered t>y atomicity is preserved. 

In conclusion, the concurrency problem is caused by the uncertainty of whether an 
incomplete computation would eventually commit, and also the requirement that 
computations should appear to execute serially when in fact they are inv(^ed 
concurrently. The problem is more serious in a system with long computations 
because long computations take a long time to complete m\6 overiap sub^antially. 

1 .2.2 Resilience Problem 

In addition to the concurrency i:M'Ot>lem, one sriso needs to deal with a resilience 
problem in implementing long atomic computations. For a system with long 
computations, the failure atomicity requirement is tx>th a blessing and a curse. On 
the one hand, the increased likelihood that a tong comiMitation would encounter 
some transient failure^ heightens our need for recovery mechanisms. Failure 
atomicity provkjes a simple interface to the appltoatton users because a computation 
is executed either in entirety or not at sdl. On the other hand, satisfying failure 
atomicity requires £dt)orting computations intorupted by transient failures unless 
sufficient intermediate ^ate of the computations has been saved. Some systems 
preserve the intermediate state of a computation through the use of replicated 
processors and memories [3, 13]. However, these systems require a degree of 
replication that may t>e too expensive for many applications. 

When a long computation is at)orted, potentially much time and work can be wasted. 
The decomposition of a comput£rtion into a nested tree of actions [40, 48] provkles a 



Sources of transient taiiure include site crashes and dMdkxta. 

18 



partial solution: an action can be aborted without undoing the effects of its sibling 
and ancestor actions, it is inadequate since actions near the top of the tree are still 
vulnerable. Transient failures that happen while these actions are waiting for their 
descendant actions to complete can cause most of the action tree to be aborted. For 
example, a setup/neeting computation can be implemented with a parent action at 
the originator of the meeting, which creates a child action at each of the participant 
calendars. Although the computation is insulated from transient failures at the 
participants, it is still vulnerable to failures at the originator site. We will describe the 
nested action model in greater detail in Chapter 2. 

1 .3 Contributions and Solutions 

Collectively, our contributions can be viewed as an argument for the feasibility of 
long atomic computations. More specifically, they can be viewed as solutions to the 
concurrency and resilience problems. We will start with an enumeration of our major 
contributions, then we will give a more detailed summary of the solutions presented 
in this thesis. 

There are four major contributions in this thesis: 

l.We show that an application can trade off functionality for more 
concurrency. By functionality we refer to the behavior of the application 
when computations are executed serially in an environment without 
failures. Our approach, like other proposal8[1,2S, 38,50, 51], uses 
application semantics to incresffie coru^urrency. However, our afH>roach, 
similar to [33], goes a step further and raises the possitMiity of 
"decreasing" functionality to increase concurrmcy. The decrease in 
functionality is achieved by introducing non-determinism. Our 
contribution is to stiow that this approach of decreasing functionality 
while maintaining atomicity is as "powerful" as other correctness 
definitions that have abandoned atomicity, such as the input consistency 
criterion described in [SO]. We will show that the exact gain in 
concurrency through the use of these crther correctness definitions can 
be realized through decreasing the functionality of ttie application. On 
this t>asis, we will claim that our atomicity definition is preferatile, since in 
comparison it is equally powerful and easier to understand. 



19 



2. Our second contribution is the deveiopment of a conflict model that 
allows the programmer to determine an approximation of the level of 
concurrency achievable with a particular functionality of an application. 
The level of concurrency is expressed as conditions under which 
conflicts occur. A conflict is created when an implementation is 
uncertain of how computations are serialized or whether a computation 
will eventually commit. When conflicts occur, computations are either 
delayed or restarted, depending on how the serialization order is 
determined. The model is useful in that it at>stract8 away the details of 
how to deal with a conflict and how the serialization order is determined. 
For example, the programmer can design the functionality of an 
application without worrying about whether a timestamp or locking 
algorithm is used. 

3. CXir third contribution relates to the study of concurrency control 
algorithms, which determine the actions that are taken when conflict 
arise and how a serialization order is determined. Although the 
concurrency of an application is significantly influenced by its 
functionality, we argue that the concurrency control algorithm still has an 
effect on the overall level of concurrency of an implementation. For 
example, the cost of a conflict is relatively insignificant if it causes a long 
computation to be delayed until the comf^tion of a short computation. 
The same is not true if the situation is rev^^aed. Our contribution lies in 
the design of novel concurrency control algorithms that can substantially 
reduce costly conflicts under certain conditiofis. 

4. Rnally, this thesis also discusses how applteations can be implemented 
such that the concurrency of the implementations woukj improve with 
the relaxation of the apptk:atk>n functk>naiity. Our contribution is the 
design of a programming interface that allows application semantics to 
be utilized without exposing the concurrency control algorithm 
underneath. Our programming interface allows a programmer to write 
programs for systems using different concurrency control algorithms 
vt^thout having to be familiar with all erf the irigorithms. The programs are 
also portable so th£rt no modif k:ations are necessary when the uruJeriying 
concurrency control algorithm is changed. 

Having enumerated the major contributions, we now proceed to give a more detailed 
description of the solutions to the concurrency and resilience problems proposed in 
this thesis. 



20 



1.3.1 Functionality - Concurrency Trade-Off 

Consider the read calendar operation discussed in section 1.2 again. Although we 
have descrit)ed the concurrency problem using the locking and timestamp 
algorithms, the problem lies in fact in the functionality of the operation. The problem 
exists regardless of how atomicity is implemented. The functionality of the 
read calendar operation that we descritied in section 1 .2 is to present an up-to-date 
view of the state of a calendar. In addition, we also require the view to be accurate 
such that it reflects cmly committed computations. This is dearly unachievable given 
that a setupmeeting computation had visited the calertdar and the calendar has no 
knowledge as to whether the computation will be committed eventually. An 
implementation must either risk presenting an inaccurate view or choose an outdated 
one. 

The solution that we iiM'opose in this thesis is not to aisandon atomicity, but rather, to 
change the functionality of the readjcalendar qaeration. For example, one can 
incorporate non-determinism in the functionality of the readjsalendar operation such 
that the set of reserved slots in the list of sk^ts returned is required to be only a 
superset of the set of reserved slots in tfie accurate view. By allowing non- 
determinism in the result returned by readjcalendar, readjcalendar does not have to 
be delayed until ail incomplete setjLip,meeting C(Mn(Kttations are completed. 
Readjsalendar can simply return all the slots reserved by incomplete or committed 
computations as reserved. The result returned by read.calondar is acceptable even if 
some of the incomplete computations turn out to be aborted later. The semantics of 
readjsalendar does not require the result to contain only committed slots. We will 
define atomicity such that it allows a non-deterministic furK:tk>nality of an applteation 
to be incorporated in the definition. Liakov et al. proposed the same solution in [33]. 

Our example can also illustrate why atomicity, coupled with the functk^nality of the 
applications, is as powerful as some other correctnees d^initions. For example, 
consider an altemiri:ive in which set.upjneeting is imptomented as a collection of 
atomic computations [15], one at each participant calendar of the meeting. If 



21 



set up meeting is to be abandoned, compensating atomic computations can be 
executed at each of the participants already visited. Concurrency is not a problem in 
this implementation because each of the atomic computations is short, interestingly, 
the t)ehavior of this implementation is the same as the one with the relaxed 
functionality of readjcalendar descritted above. Because setjip.meeting is 
implemented as a collection of atomic computations, tfie atomic computation that 
executes readjcalendar can be serialized between the atomic computation that 
reserves the slot for the meeting and a later compensating atomic computation. The 
result returned t)y readjsalendar is just as up-to-date aruj tentative as that implied by 
the relaxed functionality. The difference is that our approach provides an abstract 
specification of the behavior of the implementation, defined by atomicity and the 
relaxed functionality of the application. The abstract specification allows the users of 
the application to understarxj the behavior of the implementation more easily. 

1.3.2 implementation Paradigms 

Relaxing the functionality of the application by itself is not sufficient to solve the 
concurrency problem. For example, if an implementation of the calendar application 
uses read/write locks, relaxing the functionality of readjsalendar does not change 
the fact that a readjcalendar operation trying to acquire the read lock would still be 
delayed by a set jjp meeting computation that Is hokjing a write lock. In this thesis, 
we are also interested in how an E^plteation can be implemented such that the 
relaxed functionality of an applk::atk>n can be utilized. To provkie a summary of our 
programming paradigms, we will describe how System R, a relational database 
management system that supports sdomic computations [18], increases its 
concurrency with the semantics of its index objects. We will draw analogies tietween 
System R's approach and our paradign^ 

There are two levels of objects in System R. At the upper level, there are RSS 
objects, such as an irKlex to a relation. At tlie k)wer level, there are page objects. 
Invoking an operation on an index object involves accessing one or more page 



22 



objects. Accesses to page objects are synchronized with page locks, which can be 
viewed as read /write locks of the page objects. Because a page object that 
implements an index object may be accessed by many concurrent computations, 
locking a page for the entire duration of a computation is unacceptable. To increase 
concurrency, page locks are released at the end of an operation on an index object, 
instead of at the end of a computation. To preserve atomicity, an additional level of 
"logical k>cking" is implemented. Information £dx>ut an index operation is recorded 
when the op)eration is executed. By examining the history of past index operations, 
"conflicting" index operations that may lead to non-atomic behavior, such as 
inserting and reading from the same key value, are ctelayed. Furthermore, t>ecause 
the relevant page kx:ks have been released, aborting an index operation cannot be 
achieved by restoring the previous contents of the nxxlified pages. Rather, a logical 
undo operation is invoked during recovery. 

Our approach to implementing atomicity is similar to System R's in many ways. 
Moreover, we are interested in the following questions: 

1 . Can System R's approach of utilizing the semantics of an index object be 
applied to other kinds of applicatton-tevel objects? In particular, can the 
programs that perform the "logtcal" synchronizatkMi and recovery be 
made easier to v«nite and understand t}y foHowing a general 
implementation paradigm? 

2. Can a concurrency control algorithm akin to a timestamp algorithm, or 
some other hybrid algorithms [7], substitute for the locking protocol used 
in the page lock level or the logical kxddng level or botti? Can a 
programming interface be designed such that an £4>plicatk}n 
programmer is not aware of the concurrency control algorithms used In 
the system implementation? 

The rest of this section gives a summary of our answers to ttiese questions. 



23 



1.3.2.1 Level Atomicity 

Similar to System R's approach of implementing atomicity, ours also divides objects 
Into multiple levels. This division is more than just a division of levels of abstraction. 
As will be described in this section, the division is a partitioning of the 
synchronization and recovery code of an imptementation. For simplicity's sake, we 
will limit the discussion in this thesis to systenns with only two levels. An object in the 
higher level is implemented using the otHects In the lower level. For example, an 
index object is implemented using page <4>jects. 

To simplify the programs that acx:ess the higher-level ot^ects, all the operations on 
the objects in the higher level are made to i^spoai' instantaneous to one another. For 
example, because of the p£^e locks acquired by an Index operation, index 
operations appear to be instantaneous to one arK>ttier even though an index 
operation may access nx)fe than one page ot)ject. The lexical kx^king in System R is 
simplified t)ecause index operations can be treated as instavitarteous. The atomteity 
concept can be applied again to (yes^it tttis image ot in^antaneity. in other words, 
there are two kinds of atomic computations in air im(^ementatk>ns. The first kind of 
atomic computations are the computc^ons that we have been discu^ng in this 
chapter. They access the higher-level objects md can lfi»t a long time. In System R, 
they may be queries or updates to the drtabase. The second kind of atomk: 
computations are ttie computations used to impl9mant the operations on the higher- 
level ot^ects. They make the operations on fhe hi(^ier-level objects appear 
instantaneous and simplify the programming of ttie first kind of atomic computations. 
They are probably short. In System R, they lart for the duration of an index 
operation. To distirniuish the two kinds of atonric comput^ons, we call the first kind 
globally atomic computations and the second locally atomic computations because 
we expect in mo^ applteations the second kind will execute within a single site. 
Rgure 1-1 describes this paradigm of implementtng gtobaily atomk: computations 
with locally atomic com^nitations. 

The locally atomic computations are atomic in the sense ttiat they make operations 



24 




operations on 
higher-level objects 
e.g., index objects 



Operations on 
lower-level objects 
e.g., page objects 



Figure 1 - 1 :A Globally Atomic Computation Imptemented with 
Locally Atomto Computations 



on a higher-level object appear to be instantaneous to one another. On the other 
hand, they are not giobsOly atomic in the sense that after one of these locally atomic 
computations (e.g., a1 in figure 1-1) is ccMmpltfed, its effects can be observed by 
other locally atomic computations even though the gtobaily atomic computation that 
invokes it (e.g., a in figure 1-1) is not yet committed. For example, by releasing page 
locks at the end of an index operation o, chttiges made by o on the pa£^ objects can 
be observed t>y other index qaerations evon wtwn the (MoiKdiy atomic computatkxi 
that invoked o is not yet committed. 



25 



Using the calendar example, each mark operation in a long globally atomic 
set up meeting computation can be executed as a short locally atomic computation. 
Obviously, we need the equivalent of the logical locks in System R to make sure that 
the collection of short locally atomic computations would appear to be a long globally 
atomic computation. For example, a read, calendar op^Btton must be prevented 
from observing the effects of a mark opemtion // the result returned by readjcalendar 
is supposed to be accurate. This is because the setjjp.meeting computation that 
invoked the mark operations may be aborted later. The subject of logical k>cking will 
be discussed in the next section. 

By implementing operations on a higher-level ob^ct with locally atomic 
computations, the prograrro that invoke these operations can treat them as 
instantaneous regardless of the complexity of their imf:4em«fitations. The complexity 
of synchronization and recovery is redu(^ by dividing them into two levels. For 
example, synchronization is divided between the \OQ\caA k)cks and the pa^ kx^ks in 
System R. In our calendar example, a read. calendar opwation woukj never observe 
the state of a calendar with partiirily executed mark operations. We call thte kJea of 
implementing long globally atomic compytatk>ns vAth short locally atomic 
computations level atomicity. A similar kiea haa been presented by Beeri in [5] and 
Moss et ad. in [42] although their work is not motivitod by k>ng atorrac computations. 
The difference between our work and theirs lies in the different approaches used to 
implement logical kxddng. 

1.3.2.2 Conflict Model 

In this section we briefly describe our solutions to the following two questions: 

1 . How can the logical kxdUng in System R be extended to different kinds of 
abstract ot^ects? 

2. How can logteai kicking be extended to cover "logtoad timestamping?" 

To answer these questkMis, we will genenriize from tiie concurrency control 
algorithms synchronizing objects with only read/write operations. Examining the 



26 



timestamp and locking algorithms, we can identify three common components of 
these algorithms: 

1 . Determining how computations are serialized. It is determined by the 
order in which computations commit in a locking algorithm, and by the 
timestamp order in a timestamp algorithm. 

2. Determining when a "conflict" arises. For example, in a locking 
algorithm, a conflict arises for a read operation when it tries to acquire a 
read lock and there is another incomplete computation hokting a write 
lock. In a timestamp algorithm, a conflict crises for a write operation 
when there are previously executed reiKJ operations in>M^(ed by other 
computations with larger time^amps. 

3. Determining the action to take when a conflict arises. In a locking 
algorithm, operations are delayed, in a timestamp algorithm, operations 
are either restarted or delayed. 

Programming the k>gical locking needed for £uiy abstr»:t object can follow the 
pattern above. First, determining how computations are serialized can be achieved 
with the following: 

l.a concurrency control algorithm similar to the kx^king and timestamp 

algorithms, 
2. a programming interface from which an object implementation can 

determine the serialization order of the computatifxw that had invoked 

operations on the object. 

Second, when conflicts are created is application-deperKtent and depends on the 
functionality of an object. For example, whrther a read .calendar operation creates a 
conflict depends on its functionality and, if it is required to return an accurate vtow of 
the calendar, whetfier tfiere are ihcomplete SBtjjp.maating computations that may be 
serialized before it. In addition to caiirturing the serisMzation order, the programming 
interface that we described above shouM also capture the history of previously 
invoked operations and the status (e.g., incomptete. committed) of the computatiorw 
that invoked them. In the next sectton we witi describe such a programming 
interface. It ailows an object impiementatkMi to express the conditions under which a 
conflict arises. 



27 



These conditions are expressed in such a way that they are insensitive to whether a 
lodging or timestamp algorithm, or some other concurrency control algorithm, is 
used to determine the serialization order. For example, the condition under which a 
conflict arises for a write operation on a read/write ot^ecX can be expressed as 
follows: previously executed read operations invoked by other computations may be 
serialized after this computation. 

With a locking algorithm, this condition translates into the following condition: read 
locks are held by other computations. With a timestamp algorithm, the equivalent 
condition is: previously executed reed operations invoked by other computations 
have larger timestamps. Similarty, the condition under whk^h a conflict arises for a 
read operation is that there are previously executed write operations invoted by 
other computations that are not committed or aborted and may be serialized before 
this computation. Notrce that we have hidden underneath these conditions the 
choice of how to determine the serialization order. 

We will describe a process in whk:h these conflict corKlitions can be systematk:ally 
derived from the functionality of an abstract oti^ect. The conflict conditicxis provkle 
an approximation of the level of concurrency thsd can be achieved with a certain 
functionality. 

Finally, the actk>n that needs to be taken when a conflkst arises depends on how the 
serialization order is determined. For example, some crigorithnf^ require that an 
operation be delayed whereeo oth^ aigorithnrm require the computation that creates 
a conflict to be restarted. Similar to the conflict conditions, these actk>ns can be 
expressed without exposing the underiying omcurrency confrol algorithm. 

1 .3.2.3 Programming interface 

To implement the conflict model that we have described above, we provkle a 
programming interface that is characterized by the use oi history ob/ects. A history 
object captures the history of operations that had been executed in an abstract 



28 



object. Queries can be directed to tlie history object to determine whether a conflict 
condition is met. The interface of the history objects will make the underlying 
concurrency control algorithm transparent to the application programmers. 

When a conflict arises, some of the actions that can be taken are delaying or 
restarting a computation that is involved in the conflict. Again, these actions can be 
made transparent to the programmer and expressed in the programming interface as 
a generic resolve conflict statement. 

We will also discuss how recovery can be performed in our programming interface. 
For example, if the execution of an operation changes only the state of a history 
object, aborting a computation can be CK:hieved t^ simpiy undoing the changes in 
the history object. This is a simple actk>n and can be automsted easily. 

1 .3.2.4 Concurrency Control Algorithms 

Although we have provided a programming interface so that the progranruner is 
unaware of the urwlerlylng control concurrency algorithm, the sy^em Implementation 
has to make a choice among the available options. The sy^em implementation 
should also prowde the necessary translation from the programming interface to the 
option chosen. 

We have argued that in some applications the concurrency prot^em can only be 
solved i>y changing the f unctk>naiity of the appiictfion. it remains to be seen whether 
the choice of the concurrency control s^rHhm tftacts ttie concun'ency of a system 
with long atomic computMions significantly. We will argue that in some cases it does 
make a differmce. We wilt presmit some novel algorithms that minimize the 
likelihood that costly conflicts will arise. For example, the cost of restarting a short 
computation is much smalleN" than rs^arting a tong computation. Ccmsequentiy, an 
algorithm that makes restarting tong computations leas likely provktes a higher level 
of concurrency. 



29 



1 .3.3 Resilience Problem and Its Solutions 

To increase the resilience of long computations, we propose a checkpoint 
mechanism and the use of relay message servers. Each checkpoint specifies some 
intermediate state of a computation; the state specified t>y the last checkpoint will be 
restored after a transient failure and the computation will be restarted from that 
checkpoint. In addition to limiting the ^ect (^ site crash^, checkpoints can also 
serve as fire walls to limit the rollback due to deadlocks. Relay message servers 
provkle buffering and reliability when the network partittons frequently. Some other 
systems [10, 19] also use reliable conrnuinication primitives to Amplify the 
implementation of distributed atomic comput^ior^. The relay message service in 
this thesis is easi^ to implemem because it dora not provide any guarantees on the 
order that messages are delivered. 

1 .4 Roadmap 

Chapter 2 descnt}es our model of system hardware and assumptions. In partk:ular, 
we do not assume a reii£^)le communication network in which mesaaoea are not lost 
and are delivered in a bounded time. We fc)elieve that implementing such a network is 
prohit)itively expulsive and any upper t)Ounds on cteiivery times woukJ be so large as 
to be useless. The hardware model is folk>wed by a model of computation. Chapter 
2 concludes wth a more careful definition of atomicity. 

Chapter 3 describes our conflict model and how functk>naiity can be traded off for 
concurrency. Chapter 4 descrit)es our programming paradigms and presents 
examples dt appiteatkMi programs. Chapter 5 compares concurrency control 
algorithms and argues that some algorithms vtouki have bettw performance with 
certain types of applications. We will also present two novel algorithms: a 
hierarchical algorithm and a time-range algortthm. These algorithms minimize the 
occurrences of cosUy conflicts under certain conditions. Chapter 6 shows that 
atomicity is as powerful as some other corrsctoiess d^initi(»ns [50, 38] in which 
atomicity is abandoned «id replaced with exi^icit deacriptions of how comiMJtations 



30 



can interleave. In Chapter 7 we turn our attention to the resilience problem of long 
computations. We will describe a checkpoint mechanism and the use of relay 
message servers to buffer messages. Chapter 8 is the conclu»on. 

1 .5 Related Work 

In this section, we compare our work with related work on concurrency control and 
resilient computing. In our comparison of concurrency control, we focus on other 
systems that use application semantics to improve concurrency. Much work has 
been done in this area. Many proposals [e.g.. 23, 24, 25, 5, 8] do not consKier 
recovery issues and will not be covered in this section. Comparison with related work 
can also be found in the rest of this thesis as we descritte more details of our 
proposal. 

1.5.1 Predicate Locks 

Eswaran et al. [14] descrit)e the use of predicate locks for a relational datak>ase 
management system. An operation must acquire a predtoate k>ck b&lore it can 
proceed. Two predicate kxks conflkrt if a tuple in a relation satisfies both predicates. 
Other than assuming a locking algorithm, the jyedtoate k)cks differ from our conflict 
conditions in that the unit of concurrency \m lin^ted to a tuple. For example, using 
predicate kx:ks does not solve tfie concurrency proi^em of our calendar appiicatkm 
if each slot is imptomented as a tuple. There is itoo no obvk>us way in which a skit 
can be broken into smaller units to incresae concurrsrK:y. 

1 .5.2 Schwarz's Thesis 

Schwarz [50] deHnes correctness as the acycitoity erf computations with respect to a 
set of dependency relations. A depenctoncy between two cwnputations is formed if 
they each execute an operation al ttie swne object. Correctness requires that the 
dependency graph be acyclk:. The dependency rslattons are parameterized by the 
type of the operations invoiced and the >akie of the arguments. Dependency 



31 



relations are "insignificant" and ignored in the dependency graph if the two 
operations involved in the dependency commute. Serializability is viewed as a 
special case in a range of possit}le correctness definitions with only the insignificant 
dependency relations ignored. Less restrictive correctness definitions can be 
obtainedijy leaving out "significant" dependency relations in the dependency graph. 

The limitation of this approach is that the commutetivity of two operations depends 
on many factors usucrily. It depends not only on the types of the operations and their 
arguments, txjt also on the history of operations invoked previously and the results 
returned t>y operations. For example, whether an operation to withdraw money from 
a bank account commutes with a prevkxis withdraw operation depends on the 
balance of tfte account and the responses to these withdraw operations 
{insufficientjunds or okay). Whetho" an operation can proceed cannot in general be 
determined by pairwise (^pendencies with previously invoked operations. In otfier 
words, the limitatton of Schwarz's ai^pmach is due to a static specifteation of the set 
of dependerK^y relations included in the deperxiency graph. 

1.5.3 Allchin's Thesis 

Allchin[2] descritses several different mechanisms to synchronize concurrent 
computations. One erf tfiem uses locks with loer-d^ned lock modes. This approach 
is similar to Schwarz's and suffers from ttm same liiiytatkHis. Altohin also sugge^ 
the use of a history mechanism ^mitar to ours but taitored tor a locking algorithm. 
Recovery is supported with racovaratOe obfacts that return to their initial values when 
a computation is aborted. The sUrte of asn implementation has to be carefully 
encoded with recoveratile objects. In gmerirf, the dianges made to a recoverable 
ot^ect by tv^ computatkms viritt be lost if ttie computafion ttiat made the first changes 
is aborted. The recovery paradigms discussed in this tfiesis are different in that an 
application can invoke e^icatkm-d^pendent recovery code explicitly. Two diff^ent 
recovery paradigms wHI be discussed in ttiis ttiesis. One of them allows applk:ation- 
depefKtent code to be executed to perform stale (Conges wfien a computation 



32 



commits. The other allows application -dependent code to be executed when a 
computation atx>rts. 

1 .5.4 Weihl'8 Thesis 

Our atomicity definition follows the work of Weihl [55]. Weihl describes two types of 
objects called atomic and mutex objocts. Mutex objects we in general locked for the 
duration of an operation whereas atomic ot^jects are locked until the end of a 
computation. Two approaches, impiicit and explicit, are suggested for 
synchronization and recovery. 

In the implicit approach, synchronizatkm is achieved by testing whether an atomic 
object vtea accessed by a still incomplete computatk>n. Presumably the programmer 
can set up enough atomic objects to encode the history information needed for 
synchronizatk>n. For rao^very, the programmer sets up the atomic ot}jects so that 
when a computatton aborts, its ^fects are nuiliftod b^ the atomic objects reverting to 
their previous states. The effects of concurrent computations should not be undone 
in the process. In the explicit af^)roach, objects are associated with urKto records or 
intentions lists constructed explicitly t>y the programme^-. The undo records or 
intentions lists can be examined to d^ermine whetfier an operation can proceed. 
When a computation commits or aborts, the undo records or intentions lists are used 
to determine the state changes that need to be macte. 

In the implicit approm:h, it is unclear how tAJtwr types of concurrency control 
algorithms can be empk>yed because ttte k>ck testing of ertomic objects exposes the 
underlying algorithm. Although the explicit a[^)roacA\ does not exclude using other 
concurrency control algorithms, it does not provide an interface that makes the 
concurrency control algorithm transparent. 



33 



1 .5.5 Garcia-Molina'8 Semantic Consistency 

Garcia-Molina [38] describes a system in which computations are divided into steps 
and counter-steps. The counter-steps undo the previous steps if the computation is 
at)orted. Two steps can proceed concurrently if they are "compatible" according to 
the compatibility sets of the computations to which they bcMong. A compatibiiity set is 
determined by the type of a computation £UfKl consiste c^ sets of other types of 
computations that can interleave with this type. The limitation of the compatibility 
sets is similar to that of the dependency relations in Schwarz's thesis [SO]. Since the 
compatibility sets are defined statically, there are a large number c^ applications in 
which two computations are defined to be incompatible because they are 
incompatible for a small cUes of situations. It is also unclear how an application 
programmer can describe the t)ehavior of an impiemmitaAion in a high-level abstract 
specification. The compatitrility sets and counter-^eps are rather implementation- 
oriented descriptions of the birtiavior. 

1 .5.6 Montgomery's Thesis 

Montgomery [39] descrit)e8 the use of polyvalues to represent the values of data 
objects accessed by incomplete computations. Each polyvalue represents tfie 
possit)le values that the object may take on dep«rKling on the outcomes of the 
concurrent computations. It deals with the problwn of failure atomicity but not 
serializability, because two computations can access two (^sjects in different orders 
and both commit. 

1 .5.7 Gifford's Persistent Actions 

Gifford and Donahue [15] descrit>e executing a computation as a persistent action. A 
persistent action consiste of atomic actions and othw persistent actions. Atomic 
actions in [15] can be equated with tiie atomic computaticxis in this thesis. The 
results returned by the component actions of a perstetent action are logged in stable 
n^mory. When a persistent action is interrupled by a site crash, it is restarted from 
the beginning. When it invokes a competent actton that had its result logged, the 



34 



result can be reused instead of calling the component £K:tion again. Any non- 
idempotent operations, such as reading the time-of-the-day clock, have to be cast as 
component actions. The component actions of two persistent actions can interleave 
arbitrarily. 

In the system described in [15], it is unclear how abstract specifications of the 
behavior of persistent actions can be provided. Another difference between our 
work and theirs is our emphaas on how application-dependent synchronization and 
recovery can be programmed. 

CXir approach to resilience is also different. Instead of requiring the operations 
executed in a pers^ent action to be either idempotent or cast as a component 
action, the operations executed by the atomic computations in this thesis can be 
non-deterministic. A careful structuring of idempotent actions is not necessary. 
Checicpoints are specified explicitly. Stable memory access is necessary only at 
checitpoints instead erf whenever a component «::tion returns. 

1 .5.8 Sha'8 Thesis 

Sha [51] descrit}es a sy^em in which data objects are partitioned into atomic data 
sets. Consistency constraints in the system cannot span atomic data sets. A 
computation is called a compound transaction, which is subdivided into consistency- 
preserving elementary transactions. The elementary transactions are further 
subdivided into atomic commit segments, each of which accesses a different atomic 
data set. When an atomic commit aegmwit is finished, kx^ acquired to assure 
seriatizability are released, tnjt write locks are retained to gM^u^antee failure atomicity. 
When an elementary transaction is finished, the write locks are released and 
recovery is achieved through compensating transactkxis. 

The atomic data sets provkle a relatively coarse-gririned ccmcurrency control. Two 
data objects have to belong to the same atomic dirta set as k>ng as there is at least 
one consistency constraint relating them. Furthermore, Sha's approach does not 



35 



take into consideration the semantics of tlie consistency constraint itself. Weakening 
a constraint does not increase the concurrency of a system unless the data objects 
can be divided into smaller atomic data sets as a result. 

To increase the resilience of a compound transaction, Sia suggests storing the 
values of local variables in stable memory at the end of each atomic commit segment. 
Our approach is different in that a computation can save a portion of Its local state 
selectively. Also, we d^cribe how a computation can save its state when part of the 
state may be accessed by other computations concurrently. 

1 .5.9 Miscellaneous 

Other researchers [41, 53] have suggested the use of checkpoints to increase the 
resilience of a computation. Our work is similar to tfietrs but is motivated by 
computations that experience long communication delays. As a result, we 
emphasize how a caller erf a remote program can checkpoint in response to, or 
anticipation of, long communication delays. To avoid restarting the remote program 
that is expected to return after a k>ng delay, ttie codling program shoukl prot)abiy 
checkpoint at the remote call. Mechanisms am also provkled to allow the calling 
program and other ancestor programs to clieckpoint if an unexpected delay arises. It 
seems that in [41, 53] a computation checkpoints the entire state accessible to it, 
whereas we expect programmers to specify explicitly a portion erf the computation 
state to be preserved. 

Another approach to improving resilience is t)y replicating processors aruJ memory, 
such as in Tandem and Auragen [3, 13]. These systems consist of a collection erf 
logical processes. Each logical process is impk»mented by two physical processes, 
one primary and one secondary, on two ixocessors. In the Auragen system, the 
messages received by a k)gical process are automcAtoally chedcpointed by the 
system in the memory of a secondary processor. The secondary processor can take 
over by re-processing the messages to tNing its memory up-to-date. Any non- 
deterministic processing, such as reading the time-of-the-day clock, has to be cast as 



36 



another logical process, communicating with this process through messages. The 
application is not aware of the checkpointing except for management duties, such as 
choosing the processors for the process pair. In the Tandem system, any state 
change in the primary processor is checkpointed on the secondary processor. Our 
checkpoint mechanism is more economical because it assumes only the availability 
of some permanent memory. It is not always possible to have an availatHe secondary 
processor to process the checkpoint mesaaoBS. A site may be disconnected from 
the rest of the system and the cost of a secondary processor may be too high for 
some applications. 

Replication also provkJes a limited solution to the concurrency problem. By 
replicating objects [16, 20], computations can access nearby replicas and long 
communication d^ays can be avokied. Unfortunately, replication has its drawtjacks. 
First, it is expensive. When objects are repltoaited, conrtratnts are imposed on 
accesses of the ot^ects to ensure consistency. For exam^Me. if an ot>ject can be read 
with any one of the replicas, £^1 replicas have to be written when the object is 
updated. There is triso the cost of extra storage. Second, replication does not 
eliminate ail long computations. In the read-one-write-aH rule described above, read 
accesses can be serviced readily as long as there is a r^ica nearby. The availability 
of write accesses is decrei»ed, however. The iwigtti of a computation that perform 
updates is actually increased by repiication. 

Another limited solution to the concurrency and rseili^K^ ixoblems is to abort and 
retry a computation whai it cannot be comptotod quickly. This is unacceptable as a 
general solution for the following reasorw: 
• Previous work is wasted. 

- If the system does not retry the computation automatically, the user has 
to retry manually. 

- The computation is likely to take k}nger to comptete Vhan if it were 
allowed to suspend and wait for communicirtion problems to disappear. 
In fact, when the computation invcrives many siles and the network 



37 



partitions frequently or extensively, the computation is unlikely to be 
completed without encountering significant communication delay. 

• The deferral of the entire computation due to the unavailability of several 
sites may be unacceptable. For instance, It is undesirable to abort a 
computation that sets up a meeting among many personal calendars 
because a few of them are unavailable. Also, the likelihood of setting up 
the meeting successfully decreases with the passage of time. The 
proposed meeting, though it may be tentative, is prevented from 
appearing in the available calendars. Abandoning the unavailable 
participants and declaring the computation completed is also not the 
most appropriate behavior. 



38 



Chapter Two 
System Model 



In this chapter we give an account of a system model to prepare for discussion in 
later chapters. We start in section 2.1 by describing ^e hardware abstractions on 
which the distributed systems considered In this thesis are based. In section 2.2, we 
present a higher level view of these systems and describe how activities inside them 
can be modelled. Then, in section 2.3, we giviet a definition for atomicity based on the 
model. 

2.1 Physical Environment and Assumptions 

In this dissertation, a distributed ^st«n is viewed as a collection of machines 
connected by a communication network. We calt the machines sites; they can be any 
type of machines ranging from portable computers to mainframes or large 
multiprocessor machine. Sites can be added to or temoved from the system 
dynamically. A site can send iriBssages thrcxigh ttw network to communicate with 
other sites. Messages may be lost, duplicated, delayed for an at)itrary period of time, 
or arrive out of order, but gart>led messages will be discarded. In particular, 
messages can be delayed for an arbitrary period of time because the communicating 
sites are partitioned. We assume, however, that partitioned sites will be able to 
communicate eventually. We will not attemt:^ to handle Byzantine failures: the sites in 
the system are assumed to be cooperative, and redundant bits can be added to 
packets in the network to keep the probability of undetected garbled messages 
arbitrarily low. 

Each site possesses both volatile and stable memory^. A site also |X>ssesses one or 

^is is not strictly necessary. Sites without stsdile menuxy can employ remote stable storage 
servers. 



39 



more fail-stop processors: a processor may crash at any moment, but when it 
crashes, It immediately stops all processing t)efore sending any erroneous messages 
or corrupting its site's stable memory. The implementation of fail-stop processors 
from unreliable hardware is beyond the scope of this thesis. See [49] for a discussion 
of the subject. We assume that all crashed sites will recover eventually. When a site 
recovers, it loses the content of its volatile memory but preserves that of its stable 
memory. 

When a site sends a message to another ^te, it may expect a response. If none 
arrives after a long time, it may be because: 

- the original message is lost or still on its way, or 

• the response message is lost or still on its way, or 

- the two sites are partitioned, or 

- the responding site is crashed, or 

- the responding site is not ready to send the respcmse. 

We do not assume that the sender can differentiate among all these cases. 

2.2 Model of Computation 

At a higher level than the hardware abstractions described above, a system can t>e 
viewed as a collection of obfects. For example, there may be objects controlling 
access to personal calendars, and objects acting as printer spoolers. An object may 
reside at one site or may be distributed among many sites. Each object supplies 
several operation types; for example, a personal calendar ot^ect can support a mark 
operation and a delete operation. Arguments can be passed when an operation is 
invoked. Results can be returned with an operation. For instance, a time duration 
and a purpose can be passed to mark as arguments. Mark can return either okay or 
slotjilled. 

Computations are the units of work in a system. Inside a computation, operations on 
different objects can be invoked. A computation can span multiple sites. 
Computations are atomic and serve as units for synchronizatk}n and recovery. 



40 



Atomicity, defined more carefully in section 2.3.3, guarantees that the system 
behaves as if the computations were executed serially and each computation were 
executed either in entirety or not at ail. 

To provide a finer-grained unit in synchronization and recovery, a computation is 
decomposed into a nested tree of actions [40, 48, 34]. Actions are divided into 
top-level actions and sub-actions. A computation is associated with a single top-level 
action. The boundaries of a computation coincide with that of its top-level action. A 
top-level action can create sub-actions and sub-actions can in turn create their own 
sub-actions. Operations are executed within an action; tfiey must start and finish 
within the same action. A parent action can create several sub-actions in parallel, 
but the sub-actions will appear to have executed serially within the parent action. A 
parent action can also abort a sub-action without abandoning the work performed in 
the rest of itself. An aborted action should appear never to have been executed. 

Frequently, a computation creates a sub-action to execute an operation so that the 
effects of that operation can be undone by aborting the sub-action. However, an 
action should be distinguished from an operation because the former, like a 
computation, is merely a mechanism to define a unit of synchronization and 
recovery. It is not associated with any object. 

Aborts of an SK^ion may be caused by hardware failures such as site crashes or 
communication failures. For example, the cnsator of an actk>n can decide to alx)rt 
the action if the latter is executed on a rOTfiote site and, due to communication 
failures, the creator cannot determine whether the actton has terminated. At>orts can 
also be initiated by an application program in the absence o^ hardware failures. For 
example, an action that executes a mark operatk>n in a set jjp meeting computation 
can be aborted if too few partteipants can attend. Depending on the concurrency 
control algorithm used in a system, an action can criso t)e aborted because of 
deadlocks. When an action is fidx>rted, all its sub-actions are aborted. A 
computation is aborted when a top-level action is aborted. In general, we will use the 



41 



same terminology to refer to an action and the operations that are executed within it: 
we say an operation is aborted when the action in which it is executed is at)orted. 

A computation, its nested actions (excluding those aborted), and the operations 
executed within these actions are committed when the top-level action termirmtes 
successfully. Committed computations, actions, or operations can not be aborted. A 
computation, action, or operation is finalized when it is committed or aborted. 
Otherwise it is tentative. The outcome of a computation, action, or operation is 
determined when the it is finalized. A nested action is still considered tentative 
during the time tiiat it has terminated and the top-level action is still incomplete. See 
figure 2-1 for the possible states that a computation, action, or operation can go 
through. 




Figure 2-1 zStates of a Ck>mputation/Action/Operation 



42 



2.3 Atomicity 

In this section we will give a more careful definition of the behavior of a system in 
which computations are atomic. Our goal is to define atomicity without constraining 
the system implementations unnecessarily. The definition will be stated only in terms 
of the observable behavior of a distributed system. More Importantly, the observable 
behavior of a system will be cast in terms of the t>ehavior of abstract objects with 
abstract operations instead of the behavior of objects with read /write operations. 
Using abstract objects in our definition allows atomicity to depend on the 
functionality of these abstract objects. Our definition is similar to that in [55] except 
that ours covers nested actions. 

We will descrit>e our atomicity definition in three steps. First, we will descrit)e an 
event model, which models the externally visible activities that happen at the 
interface of an abstract object with events. The activities in a distributed system are 
modelled with a sequence of events, which we call a ttistory. The events in a history 
can be generated by different computations. Since the model does not include the 
details of how an object manipulates its internal state, the implementation of the 
object is not constrained to a particular type of implementation. 

Second, we will describe how applications can define their functionality by specifying 
serial specifications for the ot^ects in a system. These serial specifications are 
similar to the specifications that are usually used to define the semantics of at)stract 
data types [32]. They specify a set of states that an object can be in, and a set of 
operations that may cause a ^ate transition. Pre-conditions on the state can be 
attached to the operations. 

Third, since a computation can be modeiied as a sequence of events, we will define 
the behavior of a system which executes computations atomically as a set of atomic 
histories. Informally, a history is atomic if it Is "equivalent" to an acceptable "serial 
history." The set of acceptable serial histories is defined collectively by the serial 
specifications. 



43 



Section 2.3.1 describes the event model. Section 2.3.2 illustrates how a serial 
specification can be expressed conveniently with a state machine. The state 
machines help us capture the semantics of the example applications in later 
discussions more succinctly. As introducing a formal specification language is 
beyond the scope of this thesis, we will use informal notations to represent the state 
machines. Section 2.3.3 defines atomic histories with the event model and the serial 
specifications. 

2.3.1 Event Model 

In our event model, an event occurs when an operation is invoked or returned, or 
when an object is informed of the outcome of an action in which an operation of that 
object is executed.'^ Each event identifies the object and action that are involved with 
unique object identifiers and action identifiers. In this theas, action identifiers are of 
the form a.b...m.n where a.b...m is the identifier of the parent action of a.b...m.n. 
There are four types of events in the model: 

invoke events: <operatlon.type.name(arguments), Ot^ectJD, Action.lD> 

The named operation type is invoked at ObjectJD. Action.lD is 
the unique kJentifier of the action in which the operation is 
executed. 

return events: <resutt.type.narne(results), ObJect.lD, ActionJD> 

Object.lD returns the result of an operation invoked previously. 

commit events: <commlt, CX)ject.lO, ActlonJD> 

CX^ectJD is informed that the action identified by ActionJD Is 
committed. 

atjort events: <at>ort, ObjectJD, Action.lD> 

ObjectJD Is informed that the action Identtfied by ActionJD is 
at)orted. 

To simplify our notation, we assume that £hi action can only invoke an operation after 
the result to a previous operation is returned. Paralteiiam within an action can be 



^e vviU ignore I/O operations in our model although tfiey are externally visible. 

44 



achieved with parallel sub-actions. The invoke and return events of an action can be 
paired in the obvious way. 

To illustrate the event model, suppose r1 and r2 are personal calendar objects, each 

providing a mark operation to reserve a slot in the calendar. Further suppose an 

implementation of setup_meeting that creates sub-actions to execute the individual 

mark operations in the participating personal calendar objects. The following 

sequence of events may be observed when a user tries to set up a meeting between 

r1 and r2 in a top-level action a. 

<mark(time, description.of. meeting), r1, a.b> 

<okay, r1,a.b> 

<mark(time, description.of .meeting), r2, a.c> 

<okay, r2, a.c> 

<commit, r1 , a.b> 

<commit, r2, a.c> 

or the following may happen, where d is another action: 

<mark(time, description.of .meeting), r1, a.b> 

<okay, r1 , a.b> 

<mark(time, somo.other.business), r2, d> 

<okay, r2, d> 

<commit, r2, d> 

<mark(time, description.of .meeting), r2, a.c> 

<slot.fiiled, r2, a.c> 

<abort, r1,a.b>^ 

Obviously, not every sequence of events is "weil-formed." For example, a sequence 
of events should not have a commit event and an aix^rt ctvent for the same action. 
We will leave a nnore formal definition of well-formed sequences until Chapter 
6 where we construct proofs using the event model. Meanwhile, we assume all the 
event sequences are weil-formed in the sense that they represent some "reasonable" 
t)ehavior of an implementation and call them histories. 



^e have left the outcome of a.c unspecified in this example. However, it makes little difference at 
r2. 



45 



2.3.2 State Machines 

The serial specification of an abstract object can be defined with a state machine. 
Intuitively, a state machine defines the abstract states that the object "passes 
through" as individual events are "processed." This section describes how a state 
machine is specified and gives an example. 

A state machine for an object r, has four components: S,, I,, T,, and N,. S, is the set 
of possible states of the state machine. I, is the initial state. T, is the set of 
transitions; it corresponds to the set of possible invoke and return event pairs, since 
not only the invoke event, t)ut also the result that has been returrmd, determine how 
the state is to be changed. N, is a partial function vi^ich determines how and under 
what conditions the state machine would change its state. It takes two inputs: a 
"before" state and a transition, and returns an "after" stale. 

N, can be extended in the following way to accept a sequence of transitions aa its 
second input: 

N,:S,XT,--.pS, 
such that N|(8, 0) s s, 

Ni(». ts.qll«) = N,(N,(8. t„q). t). if N,(.. t^) * X 
±, otherwise 
where <> is the empty sequence, s € S|, t € T|, t,^ € T|* 
The partiality of N, can be used to exclude undesirable transition sequences from the 
object. In other wonte, a solal speclfk^tion can be vtewed as defining a set of 
acceptable transition sequences. 

Supp<»e r, is an object representing a set (rf integers. It supporte three operations: 
insert, delete, and member. Each operati(Mri tirites oi integer as an argument. Insert 
£KJds the integer to the set ami returns oka^. DbMb deletes ttie integer from the set if 
the integer is in the set and returns okay in any cs»e. Member returns a boolean 
depending on whether the integer is an element df the set. The serial s()ecification of 



46 



this set object is defined in figure 2-2. Abbreviations of the form op.arg.result will be 
used for the transition <op(arg), r.^, aXresult, r^, a>. 

Sp sets of integers 

l,:0 

Tji insert.x.okay = <lnsert(x), r,, aXokay, r,, a> 

deiete.x.okay = <delete(x), r,, aXokay, r^, a> 

member.x.b & <member(x), r,, aXb, r^, a> 

where x is an integer, b is a boolean 

N,(s, insert.x.okay) X 8 U {x} 

N.(s, deiete.x.okay) s s - {x} 

N,(s, member.x.b) s s if (xCs and b s true) or (x^s and b s false) 

Figure 2-2:A State Machine for a Set 



In figure 2-2, the object starts with an empty set as its initial state. Three kinds of 
transitions are possit>le. Each kirtd of transitions changes the state in the obvious 
way. Notice that N, is defined only under the condition (xCs and b s true) or (x^s 
and b s false) for the state s and the transition member.x.b. For example, a 
sequence of transitions in which an insert.x.okay transition is followed immediately 
by a member.x.false transition would be und^ined with respect to N| and hence 
unacceptable. 

We have Introduced the terms "event" and "transltiai" in this section. Each of them 
denotes something similar to an operation. The executkm of an operation can be 
viewed as the generation of an invoke event and a return event, or as the generation 
of a transition. Since different results can be returned by an operation, different 
transitions may be generated t)y ttie execution of an operation. For example, the 
member(x) operation generates either a member.x.true or a member.x.false 
transition. 



47 



2.3.3 Atomic Histories 

In this section we will combine the event model and serial specifications to define a 
set of atomic histories. First, we will define what a serial history is. Second, we will 
descrit)e how a set of acceptable serial historic can be defined using the serial 
specifications. Finally, we will define when a history is equivalent to a serial history. 
An atomic history is a history that is equivalent to an acceptable serial history. Again, 
we will rely on informal descriptions and leave a more formal notation until Chapter 6. 

A serial history is a history in which events from different actions are not interleaved, 
an invoke event is always paired with a return event, and only invoke and return 
events exist. The events in a serial history are ordered by a linearization, which can 
be defined as a total ordering between every pair of sibling actions [34]. As a special 
case, the top-level actions can be consklered as sibling actions. An action b is 
subsequent to a according to a linearization L If either b or one of b's ancestors is 
after a or one of a's ancestors in L. An action a is prior to another action b if and 
only if b is sut>sequent to a.^ 

Ideally, this prior/subsequent relationship shoukJ be extended to the operations 
executed in two actk>ns in the obvious way. However, k)ecause more than one 
operation may be invoked at the same object by the same action or by actions that 
bear an ancestral-descendant relationship, the following more complicated definition 
is needed. An operation a is prior to another operation b at the same object 
according to a serial history sh if: 



^e assume that there are lingt«stic mechanisms for me application programmer to express the 
desired linearization constrants among sibling actions. For example, if b is created after a t)y the same 
parent action, then nirturally b should be sutMequent to a. In me rest of this thesis, we only consider 
linearizations that conform to Itiese constrairtts. Occastonalty, an action will create pw^lel sub-actions 
and the order among them is left urapecified by the appliciAon. Any total ordering will be acceptable in 
those ( 



We do not provide any facility for the users to constrain the order among the top-level actions except 
a guarantee o( external consistency. If a linearizirtion is extemaUy conwstent, a computatk>n a is 
ordered after another computation b if a is begun after b is comfOeted and me compleHon of b is 
communicated to the human user of a either extemiMy (outside the i^stem) or internally (mrough 
messages sent and received by the sites in the system). 



48 



1 . the action in which a is executed is prior to that of b according to the 
linearization of sh, or 

2. a and b are executed in the same action and a is executed before b in 
sh.or 

3. the actions that a and b are executed in bear an ancestral-descendant 
relationship and a is executed t)efore b in sh. 

An operation a is subsequent to another b if and only if b is prior to a. This definition 

is well-formed because we assume that an action can execute only one operation at 

a time and a parent action cannot invoke any operation while a child action is not 

terminated. 

We define a serial history sh to be acceptable if, by partitioning sh according to the 
object that an event is associated with, each of the sub-histories is an acceptable 
transition sequence according to the serial specification of the object associated with 
that sub-history. 

Finally, a history h is equivalent to a serial history sh if h is identical to sh after all tHJt 
the committed invoke and return events are removed from h and the events left 
behind are rearranged according to the lin^urization of sh. A history is atomic if it is 
equivalent to an accepterisle serial history. A system is correct if it generates only 
atomic histories. The linearization o^ sh is caltod a ser/a/izaf/on order. By excluding 
all but the committed events from a history h, we formalize the requirement on failure 
atomicity. By requiring h to be equivalent to a serial history in which wents are not 
interteaved, we formalize the requirement on serializs^lity. 

Notice that our definition is different from some other atomicity definitions [46, 1]. In 
these definitions, an atomic history is defined as equivalent to a serial history if the 
two histories both cause the objects in the hi^ories to reach the same states. CXir 
definition requires that an atomic history has the same external behavior as a serial 
history. Our requirement is sufficient as a loer cannot determine the state of an 
object except through observing its visible laehavior. For example, a bank custonter 



49 



does not care about the internal state of a bank account object as long as he can 
withdraw what is in his account and the balance on a monthly report is not less than 
expected. Our definition also has the advantage that we do not have to define the 
states that the objects will be in after executing a possibly non-serial history. 

The major advantage of our atomicity definition, however, lies in its ability to 
incorporate serial specifications of abstract objects. If serial specifications are 
relaxed to enlarge the set of acceptable serial histories, the set of atomic histories is 
also enlarged and the system becomes more concurrent, provided an 
implementation can utilize the relaxed semantics. Thus concurrency is increased 
without sacrificing the simplicity offered by atomicity. 



50 



Chapter Three 
Using Application Semantics 



In this chapter we describe the increase of concurrency that can be achieved 
through the use of application semantics in an implementation. To avoid being 
encumbered by excessive implementation details, we ignore how the implementation 
is actually programmed in this chapter. Instead, we assume an idealized 
implementation that would illustrate how concurrency can be improved when 
compared to an implementation that, say, uses read /write locks and 2-phase locking. 
We will descril)e how the idealized implementation can be approximated t)y a 
practical implementation in Chapter 4. The concurrency level afforded by the 
idealized implementation is only an approximation of the actual concurrency level of 
a practical implementation. We will argue why it is a useful approximation later in the 
chapter. 

Our idealized implementation consists of multiple (xogram modules, each 
implementing an at)stract object. We assume that a program module has encoded a 
history of previously invoked operations and that the history information can be 
retrieved. Each of the objects^ has an associated queue of requests to invoke 
operations at that otnect. These requests are issued t)y computations running in the 
system. An object executes by taking a request from its queue, examining the 
request and the history of previous operations, and determining whether a result can 
be returned for the requested operation. A result can be returned when an object 
can guarantee that only atomic hi^ories are generated. 

If a result can be returned, the request and the result will be added to the object's 



^Wa will use the word "obiect'' to refer to the program module implementing the obiect. 



51 



history. Otherwise, a conflict is created and we assume that some action will Ije 
taken against the request or the computation that issues the request. We leave these 
actions unspecified for the moment, since our purpose is to evaluate the 
concurrency of the implementation, which can be measured by how often a result 
can be returned to a request. In an actual implementation, the operation may be 
delayed or the computation that invokes the operation may be restarted when a 
conflict occurs. Thus, how often a conflict arises Is a realistic measure of 
concurrency. We assume that an object can process a request instantaneously. 
Details such as how the internal state of an object is encoded and how recovery is 
performed will be left unspecified. However, we do ^sume that an object will learn 
of the outcomes of computations eventually. 

In order to illustrate how application semantics improves the concurrency of the 
idealized implementation, we will describe a conflict model, which is one of the 
contributions of this thesis. The conflict model allows a i^ogrammer to determine the 
condition under which a conflict is created based on the serial specification of the 
object. We call this condition a conflict condition. The nrnxJel is useful in that it 
abstracts away the details of the concurrency control algorithm underneath. A 
conflict condition will remain the same regardless of wfiether the abstract objects in a 
system use timestamps assigned at the beginning of execution, or the order in which 
computations commit, to determine a serialization order. Conflict conditions can 
serve as a guide when serial specifications are designed, so that concurrency can be 
traded off against functkxiaiity. 

In section 3.1 we describe our conflict model. In section 3.2 we use a bank account 
object to illustrate how conflict conditions can be derived and how concurrency is 
improved when compared to an implementation that uses, say, read/write locks and 
2-phase locking. A bank account example is used in this chapter to facilitate 
comparison with other work. In section a3 we discuss how conflict conditions can 
be derived for any abstract object. Because the practical implementations that will 
be descrik)ed in Chapter 4 approximate the idealized imptementation closely, the 



52 



process of deriving conflict conditions is also helpful to a programmer writing the 
practical implementations. In section 3.4 we describe how concurrency can be 
increased by relaxing the serial specification of an ot^ect. Relaxing the serial 
specification of an object makes conflicts less likely to arise. Using several 
examples, we will illustrate that there are interesting classes of applications in which 
the trade-off between concurrency and functionality can be usefully employed. In 
Chapter 6 we will show that this approach of increasing concurrency is as powerful 
as other correctness definitions that abandon atomicity [50. 38]. 

3.1 Conflict Model 

This section descrit>es our conflict model and defines conflicts more carefully. We 
show how the requirement of generating only atomic histories can be translated into 
a requirement of detecting conflicts. 

3.1 .1 Generating Atomic Histories 

To ensure that only atomic histories are gen^ated by our idealized implementation, 
the objects in the implementation must guarantee that any history generated will be 
equivalent to some accepts^e serial history. To provide this guarantee, the objects 
must agree on a particular serialization order, which, in an actual implementation, 
may be determined t)y tfie timestamps that are a^gned at the beginning of 
execution, or t}y the order in which computatiorKS commit. How this serialization 
order is arrived at in an actual implementaticMri d^aends on the concurrency control 
algorithm and is the suk)ject of Chapter 5. We refer to this serialization order 
determined by the concurrency control srigorithm as the serialization order of the 
system. We assume that this is what is referred to when we speak atx}ut the 
serialization order among operations. 



53 



3.1 .2 Guaranteeing Equivalence to Serial Histories 

To ensure that the history generated by the implementation is equivalent to an 
acceptable serial history defined by the serialization order, each object must ensure 
that the committed events involving itself, after being rearranged according to the 
serialization order, will be an acceptable transition sequence according to the 
object's serial specification. More informally, each object must make sure that the 
transitions that it generates are part of an accept^}le serial history defined by the 
serialization order. We say that an object exhibits atomic behavior when this is 
satisfied. 

For example, consider a bank account object r, with a serial specification descrit>ed 
by the state machine in figure 3-1 . To sidnplify our example, we assume the state of 
the bank account contains only its balance, which can be represented with a real 
numt)er. The account object has three types of operations: deposit, wittidraw, and 
readbaiance. The first two take a real number a& an argument. Deposit increments 
the balance by the amount indicated in the argument aruJ returns ol<ay. Withdraw 
decrements the balance by the amount indicated in the argument and returns okay if 
the t)alance is large enough to cover the witiidrawai. Otherwise it returns 
insufficientjunds. Read_balance returns the batanoe. 

S,: real numbers 

l,:0 

T,: <deposlt(x), r,, aXokay, r,, a> « deposH.x.okay 

<withdraw(x), r,, aXokay, r,, a> a withdraw.x.okay 

<withdraw(x), r, aXinsuff icient.fuiufs, r|, a> > wHtidraw.x.insuf 

<read.baianceu, r,, aXx, r,, a> s read.x 

where a is an action^ x is a positive real number. 

Nj(s, deposit.x.okay) s s -*■ x 
N|(s, withdraw.x.okay) s s - x if s > x 
N,(8, withdraw.x.insuf) s s if s < x 
N|(s, read.x) a s if s s x 

Figure 3- 1 :A State Machine for a Bank Account CX>ject 



54 



Suppose the history depicted in figure 3-2(a) has action a serialized i3efore action b. 
Because the transition sequence depo8it.40.okay H read.balance.60 depicted in 
3- 2(b) is not a memt)er of the set of acceptable transition sequences defined by the 
state machine in 3-1, the history in figure 3-2(a) is not atomic, and hence the t)ank 
account object that generates the history in figure 3-2(a} does not exhibit atomic 
behavior. 

<deposit(40), r^, a> <depO8it(40), r,, a> 

<okay, r,, a> <okay, r,, a> 

<deposit(20), r,, c> <read.balanceO, r,, b> 

<okay, r,, c> <60, r,, b> 

<read balaneeOf f^^ b> 
<60, r,, b> 
<abort, fj, c> 
<cominit, r,, b> 
<comniit, r,, a> 

(a) (b) 

Figure 3-2: A History and a Transition Sequence 



3.1 .3 Generating Atomic Behavtor 

To ensure atomic behavior, each of the results returned by an object must be valid. A 
result is valid if the corresponding transition^ causes a ctefined state change in the 
state machine representing the serial specification of the object, given that the state 
machine starts in a state (^ined by executing all the committed transitions serialized 
before this transition. For example, in the (yevious l3ank account example, the result 
60 is invalid because the ^ate machine hsa a state of 40 after executing the 
committed deposit.40.okay transition, stfid the steto machine requires a 
read.balance.x transitk>n to have its result x equal to tlie current state. Notice that 
vt^hen an object generates a result to an operation, it must ensure that not only the 
result is valid, t)ut that ail other results returned to rwevkHisiy invoked operations 
should remain vaiM. 



^Recall that a tranaition corresponds to a pair of invoke and rstivn events. 

55 



3.1 .4 Generating Valid Results 

Obviously, in many cases we need some knowledge of the serialization order to 
generate valid results. For example, to return a valid result to a read.balance 
operation invoked on a t)ank account object, we need to determine how the 
read.balance operation is serialized with respect to previously invoked deposit and 
withdraw operations. 

In addition to knowing the serialization order, we also need some knowledge of the 
outcomes of the operations that have been invoked. For example, knowing the 
serialization order between a readbalance operation and a deposit operation is not 
enough to determine a valid result for read,balance\ we also need to know the 
outcome of the deposit operation if the read balance operation is serialized after the 
deposit operation. How the knowledge of a computation outcome is disseminated to 
the objects that the computation had accessed is determined by a commit protocol. 
We will discuss commit protocols in Chapter 5. 

In our conflict model, each object is viewed as possessing some knowledge of the 
serialization order and the outcomes of the operations that have been invoked. An 
object may not pc^sess complete knowledge because some operations are still 
tentative; they may be either aborted or committed later. In fact, a computation can 
be finalized already but ttie ot^ects that it has accessed witi not have the knowledge 
of ite outcome until the outcome is propagated to these objects. In Chapter 5, we will 
discuss how the serialization order is determined. In some algorithms, it is pre- 
determined and an object always lias complete knowledge of the serialization order 
among the operations that have been invoked. In some algorithms the order is 
determined dynamicaity. 

When determining whether a valid result can be returned while jxeserving the vaiklity 
of all previous results, an object must t>e (xepeured for all the possible combinations 
of serialization orders and outcomes of the tentative (^swattons that are consistent 
with the kx^al knowledge. Informsdiy, a conflict is creaAeti when no result can be 



56 



returned such that it and all previously returned results will be valid under all 
circumstances consistent with the local knowledge of the serialization order and 
operation outcomes. For example, a read.balance operation invoked at a t>ank 
account object may create a conflict tsecause the object lacks the knowledge of the 
serialization order between the read.balance operation and a previously invoked 
deposit operation. The serialization order determines the valid balance to return and 
there is not a result that will be valid under ail drcurratsmces. 

3.1.5 Conflicts 

A conflict may be created even when an object possesses complete knowledge of the 
serialization order and operation outcomes. For example, a deposit operation can 
create a conflict because the local knowledge dictates that the deposit operation is 
serialized before a previously invoked read.balance operation. Unless the deposit 
operation is refused, the result returned to the read balance operation may be 
invalidated when the deposit operation is committed. 

On the other hand, suppose we have a t)ank account object with an initial balance of 
$1 00 and the following history of events: 

<withdraw(40), r, a> 

<okay, r, a> 

<conifnit, r, a> 

<withdraw(30), r, b> 

<okay, r, b> 

No conflicts would be generated if a withdraw(20) operation were invoked on the 
account object, since an okay response to the withdraw operation is valid, and the 
okay response to the prevtous withdraw operations are not invalkJated, regardless 
of the serialization order and outcomes of the operations. 

Notice that whetfier conflicts are created depends rK>t ju^ on operations that are 
tentative or for which the serialization order with respect to the incoming operation is 
unknown, but actually on the entire history of events, in the previous example, 
conflicts would be created if action a had withdrawn more than $50, since whether 



57 



the incoming withdrawal can succeed would depend on the outcome of b. 

Conflicts can also disappear with the execution of new actions not already in the 
history. Suppose action a in the example above had withdrawn more than $50 and a 
conflict is created when an action c invokes withdraw(20) at the account object. The 
conflict will disappear if another action d executes a deposit operation, commits, is 
serialized before c, and the amount deposited by d is large enough to cover the 
withdrawal by c. 

When a conflict is created, it can be resolved in several ways: 

- delay the operation generating the conflict, e.g., 2-pha8e locking [1 7]; 

- restart the computation generating the conflict, e.g., timestamp 
algorithm [48]; 

- make an assumption about the serialization order or operation outcomes 
and verify the assumption later, e.g., optimistic algorKhms [26]. 

In this chapter, we will not elatx>rate on how conflicts are resolved. The appropriate 
way to resolve a conflict is related to how the serialization order is determined. We 
will discuss the sut^ect in Chapter 5 when we discuss concurrency control 
algorithms. Suffice it to say that resolving a conflict represents a potentially high 
cost. 

3.1.6 Conclusion 

In this section we have described how the requirement of generating only atomic 
histories can be translated into the requirement of detecting conflicts. The conflict 
conditions that can be derived from serial specificatk>ns ate a useful indication of the 
level of concurrency of our idealized implementation because they abstract away the 
details of the concurrency control algorithm underneath. The conflict conditions are 
a good approximatkMi of an actual imf^ementcritiCHi's concurrency if the actual 
implementation approximates closely the assumptions of our idealized 
implementation. For example, for a long computatk>n whose length is attributed to 
communication delays, regarding the execution of an opwalton in the computation 
as instantaneous is a close apr»X)ximation to the actual execution. Our model of the 



58 



structure of the idealized implementation is also sufficiently general so that for any 
implementation that conforms to this structure, the conflict conditions can be 
regarded as an indication of the upper bound on an implementation's concurrency 
level. Executing an operation non-instantaneousiy would only decrease 
concurrency. 

3.2 An Example 

In this section we will use the bank account object defined in figure 3-1 to show the 
following: 

1 . How conflict conditions can be derived from a serial specification. 

2. How the semantics of an application can be used to increase 
concurrency over an implementation that uses, say, read/write lodts and 
2-phase lodting. 

3.2.1 Read.Baiance Operations 

Consider when the operation readjbalance is invoked on the bank account ot>ject r, 
defined in figure 3-1 . Since the read.balance.x transition does not mutate the state 
of r,, the results returned to the previously invoked operations will remain valid 
regardless of the outcome and the serialization order of read_balance. However, 
read .balance itself returns a result whose vaikiity deperKJs on the serialization order 
and outcomes of other operations. 

Among the set of transitions, only deposit.x.okay »id withdraw.x.okay change 
the balance. Hence, a conflict is created if the foitowing condition is met: 

1. there are deposit or successful withdraw operatk>ns (ones that had 
returned oiiay) that are tentative and may be serialized before the 
read.balance operation, or 

2. there are committed deposit or successful wittidraw operations that may 
be serialized ather before or after the readJ)alancB operi^on, 

In other words, the account object can not return any number to tiie read,balance 
operation that is guaranteed to be valid under aii possitto situations. Nk>tice that we 



59 



have used the terms "may be serialized before/after" and "tentative" in the conflict 
condition atx)ve. It reflects the view in our conflict model that an object possess 
some knowledge of the serialization order and operation outcomes. In the following 
discussions, we will use the terms "potentiaJly prior" and "potentially subsequent as 
abbreviations for "may be serialized before" and "may be serialized after" 
respectively. The terms "definitely prior" arKi "definitely sutraequent" are 
abbreviations for "definitely serialized tjefore" and "definitely serialized after" 
respectively. 

There is a remote possibility that sonrte tentative deposit and withdraw operations 
may cancel one another's effects, and because they are executed by the same action 
or by sibling actions in the same computation, they are constrained to commit or 
abort together. In those cases, no conflicts are created although there ere tentative 
deposit and withdraw operations. We will ignore such possibilities because it is 
rather unlikely for a computation to deposit as welt as withdraw from the same 
account. 

Suppose we have an implementation that uses a read/write lock on the balance such 
that both deposit and withdraw would first acquire a read lock and then a write lock, 
and read. balance woukt acquire a read lock only. For the readjaaiance operation, 
there is no increase in concurrency with the use of the semantics of the account 
object. The situations under which conflicts are created for this operation are exactly 
the same in our idealized implemerttation and the implementation that uses a 
read/write lock. 

3.2.2 Withdraw Operations 

The withdraw operations can illustrate how concurrerK::y is increased with the use of 
appircation semantics. Consider when the operation withdraw(x) Is Invoked at r,. 
The result of the operation is either okay or insufficientjunds, depending on wftettier 
X is less than the balance. Since an insufficientjurtds reply does not imply a change 
to the at^tract state, no previous results returned wiH be invalidated. However, 



60 



because an insufficient Junds reply implies that the l)alance is less than x, the reply 
can be returned only when the highest po^k>le balance under the possible 
combinations of serialization orders and outcomes of the operations that may be 
serialized before the withdraw operation is less than x. This highest possible balance 
can be calculated by adding all the unaborted and potentiaiiy prior deposits to the 
initial balance and subtracting all the committed and definitely ixior withdrawals. 

Briefly, as long as the t}alance is so low that there would not be sufficient funds under 
any circumstances, Insufficientjunds can be returned, even if there may be tentative 
update operations or update operations that may be sertaiized either before or after 
the withdraw operation. Consequently, some conflicts that would be created had a 
read/write semantics been imposed are avoided. Although this is not the most 
significant improvement in concurrency over an imii^ementation using read/write 
locks, it does illustrate the use of the history of previous invocations, the current 
operation's argument values and results, and the types (^ operations in determining 
whether conflicts are created. This is in contrast to some otfier approaches that rely 
only on the operation type and argument values to determine whether conflicts are 
created [50]. 

A more significant improvement in concurrarKsy happens when there is a large 
balance. Again consider the withdraw operaton txit th^ time consider an okay reply. 
Since an ol<ay reply implies a decrement of the balance, the commitment of this 
operation may invalidate the results of the foHowing kifKls of operations: 

1 . a potentially subsequent read.da/a/ice oper^bon, or 

2. a potentiaiiy sut>sequent and successful withdraw operation^. 

To avoid creating any conflicts, ttiere must be no operations of either kind if an ofcay 
reply is to be returned. The number of conflicts can be further reduced if we 
recognize that pot^itially sut}sequent withdraw(x') operations are permissible as 
long as there is enough nK}ney to cov«r all the withdrmvals. Or, more atgorithmically, 



^he result of an unsucoestful withdraw operation wHt not be invalidated because the newly arrived 
witMraw operation will never increase the balance. 



61 



when the lowest t)alance under the possible combinations of serialization orders and 
outcomes of the operations potentially prior to the withdraw(x') operation (with this 
operation included) is at least x'. 

Again, in addition to preserving the validity of the previous results, we must also make 
sure that the okay reply is valid tjefore returning it to the withdraw operation. 
Because an okay reply implies that the balance is at least x, it can be returned only 
when the lowest balance under the possible combinations of serialization orders and 
outcomes of the operations potentially prior to the withdraw operation is at least x. 

The discussion at)ove shows that a withdraw operation will not create any conflicts 
as long as the t>alance is either large enough to accept tf>e withdrawal or small 
enough to refuse the withdrawal, despite any uncertainty created by concurrent 
updates. When compared to £ui implementation that uses read/write locks, it 
represents a significant improvement on concurrency. 

The withdraw operation is representative of a large class of operations that can avokJ 
the creation of conflicts most, but not all, of the time. Whether a conflict is actually 
created depends on the state of the object. The state of the ot^ect includes not only 
what other concurrent operations are being exeojted, but also all previous 
committed operations. 

We will not discuss the conflicts that will be generated by a deposit operation, except 
to note that because there is only one possible result {okay), which is defined for all 
input states, this result is srfways valid. However, depo^ may still create conflicts 
because it mutates the state of the account and so it ^toy c^ect the validity of othor 
results. In Chapter 4 we will discuss how this bank account object may be 
implemented practteally. Two different implem^itations are shown in figures 4-4 and 
4-5. 



62 



3.3 Deriving Conflict Conditions 

In the previous section we illustrated, with the t>ank account object example, how 
conflict conditions can be derived. In this section we will generalize from the bank 
account example, and describe the process by which conflict conditions can be 
derived from the serial specification of any abstract object. As wili be seen in 
Chapter 4, deriving these conflict conditions is an essential component of an actual 
implementation. 

In general, a conflict condition depends on the type of a transition. For example, 
different conditions are required for a withdraw operaticm to reply with an okay or 
insufficientjunds response. A conflict is created for an operation if every possible 
transition of that operation creates a conflict. For each transition, the process of 
deriving the conflict condition can be exjxessed conceptually as follows: 

1. Based on how the e^jsb-iK^t state is mutated by the trevisition, determine 
the set of potentially subseqt^nt oper^ions in the history of the object 
whose results may be invalklated. For a transition that only okjserves the 
at)stract state, such as a withdraw.x.insuf transi^im, the set is «npty. 
For a withdraw.x.okay transition, the set includes any potentially 
sut>sequent nad.balance operations and other wjccessful withdraw 
operations. 

2. Derive the condition c1 under v^ich the results of the set of operations 
discussed in item 1, if the set is not empty, ^11 remain valid with every 
possible comt^naMon of serialization order and outcomes of their 
potentiatiy prior <H3^ations. For example, in ontor to return okay to a 
withdraw(x) operation, there mu^ not be any p(^«itially sut}sequent 
read.balance operations, and, if thwe are any p^wttiiUly suk)sequent 
successful withdrawix') operations, the towe^ bttf«K:e under the 
po8sit)le comtMnations of serializaticm order and outcomes of the 
c^jerations potentisMy prior to the withdrawix'} openAkm (with this 
withdrawal included) shouki be tA least x'. 

3. Based on how the atsiraxA state is mutated tiy other operations, 
determine the set of potentially prior opw^atfions whose outcomes or 
serialization order may itffect the rssutt to ttite triffisition. For examiiMe, 
the set is emf^ for a deposit.x.okay tewurition because the deposit 
operation can only return okay. For a wllhdfaw.x.okay transition, the 
set includes sril deport wkI succe^ul withdraw operations that are 



63 



tentative and potentially prior to this transition, or that can be either prior 
or subsequent to this transition. 

4. Derive the condition c2 under which the result of this transition will 
remain valid with every possible combination of serialization order and 
outcomes of the set of operations discussed in item 3 (if the set is not 
empty). For example, in order for an okay reply of a withdraw _x,okay 
transition to be valid, the lowe^ balance under the possible 
combinations of serialization order and outcomes of the operations 
potentially prior to this transition should be at l^ist x. This lowest 
possible balance can be calculated t)y assuming that all the potentially 
prior tentative deposit operations are eith^ aborted or serialized after 
this transition, and all the potentially prior and tentative successful 
withdraw operations are committed and serialized before this transiticm. 

5. The result of this transition can be generated without creating any 
conflicts if the condition (c1 and c2) is satisfied. 

The result of following the process atx>ve is a conflict condition, ^ (c1 and c2). The 
conflict condition can be used as an indication of the concurrency that can be 
achieved with the particular functionality assumed in the process. 

The process described atx>ve can be simplified considerably when the concurrency 
control algorithm is specified. For example, with a timestamp algorithm, there is only 
one possible serialization order. It is not po^ble for an incoming operation to be 
both potentially prior and subsequent to anotfmr operation. 



3.4 Increasing Concurrency 

In the last two sections we described how conflict conditions can be derived based 
on the serial specification of an object and how the semantics of an application can 
be used to increase the concurrency of a system. By relaxing a serial specification, 
or more precisely, by increasing the set of acc^jltatte transiticKi sequences, conflicts 
become less likely to arise and (xmcurrency is increased. The same idea has been 
suggested by Listcov and Weihl in [33]. This section uses several examples to 
illustrate this trade-off between functionality and concurrency. We hope to convince 



64 



the reader of the usefulness of the trade-off. In Chapter 6 we will take a more formal 
approach to show the power of our atomicity definition. We will show that our 
atomicity definition is at least as powerful as other correctness definitions [SO, 38] 
that had abandoned atomicity. The same gain in concurrency through the use of 
these correctness definitions can be achieved through trading off functionality in our 
atomicity definition. 

There are several interesting classes of situations in which the semantics of an 
application can be changed to increase concurrency while the new semantics 
remains useful. The following list is not intended to be exhaustive, but rather serves 
to illustrate some interesting ways in which semantics can be changed. 

3.4.1 Reducing Precision of Numerical Results 

In one class, the precision of a numerical result is reduced to allow for more 
concurrency. For example, a t>ank account object can provide an operation 
lower bound balance that does not take any argument and returns a value that is a 
lower bound for the t>aiance. The foilcming can be added to the state machine in 
figure 3-1 on page 54. 

Tj: <lower.bound.balanceO. r,, aXx, fj, a> r Ibalance.x 
N|(s, Ibalance.x) = s if s>x 

By returning the lowe^ balance under all possit>ie (x>mbinations of serialization 
orders and operation outcomes, the result is vaHd yet never create any conflicts. 
Note that the increase in concurrency is "two-way." Not only does 
lower_bound.balance never create a conflict, t)ut a deposit operation invoked 
afterwards will also avoid creating any conflict due to the possitnlity that it may be 
serialized before the lower bound .balance operatkKi.^^ Although the result to 
lower.bound.balance is not exact, it may be useful when the caller is using it as an 
estimate. 



^However, it is pos8it)ie for a withdraw operation invoked irfterwards to create a conflict due to the 
lowBfJboundfialance operation. 



65 



Similar operations that increase the concurrency of the account object are 

upper bound balancB, balance .range (which returns the upper and lower bounds), 

and approximate, balance, which takes a fraction as an argument and returns a value 

guaranteed to be within a range of the balance determined by the fraction. 

T,: <approximate.balance(p), fj, aXx, r,, a> s abalanc«j>.x 
N,(s, abalancejj.x) = s If s*(1-p) < x < 8*(1 +p) 

Consider another example in which an application is implementing a di^ributed 
ticketing agent. A fixed numt>er of tickets is divided lunong several sites for 
availability reasons. Each site can sell tickete from its allotment. Occasionally, a 
computation may be started by one of the sites to record tfie numt>er of tickets left in 
other sites and re-distribute the tickets. Suppose we regard each site as a ticket 
account, supplying operations identical to those of the bank account defined in 
figure 3-1 . The "balance" of the account represents the numt)ers of tickets unsold in 
the allotment in this site. Re-distributing the ttekets wouW involve two phases: in the 
first phase, read.balance operations are invoked at each oi the sites; in the second 
phase, based on the values returned t)y the readjbalance operations, deposit and 
withdraw operations vmII be invc^ed at the aii^^ropriate sites. The errtire computation 
can be atKtrted if one or nK>re of the withdraw operations returns Insufficientjunds 
(more accurately, insufficient Jicketa). 

One of the problems of this imptementation is that the swucmtics of the read.balance 
operation may prove to be too restrictive. Tickets are prevented from being soki 
while the re-distrikHjtion is proceeding because setting a ticket involves invoking 
withdrawn), which rrtay create a conflict witii a potentMty subsequent read.balance 
operation. Concurrency can be improved if the value returned by read. balance is 
treated as a hint. Although the withdraw operations in a re-dtstrilsution computation 
may find the actual number of tickets awaiiat^ for re-di8trtt>ution is not the same as 
that claimed in the hint, correctn^s is not compromised. A re-distribution 
computation can always be aborted. In fact, the two phases of the re-distribution 
computation can be separated into two comfMJtations. However, it may become 



66 



counterproductive if the hint loses too much of its accuracy. A more appropriate 
strategy is to keep the two phases in the same computation but use the 
approximate .balance operation in the first phase to record the tickets left in each site. 
Approximate balance allows other update operations to proceed concurrently. On 
the other hand, it sets a limit on the imprecision of the result returned so that in most 
cases tickets are re-distributed " reasonat^y. " 

3.4.2 Conditional Operations 

Another interesting c\as& of situations in which the semantics of an application can 

be relaxed to increase concurrency involves "conditional" operations. Consider a 

change meeting jplace operation for the perscmal calendar object described in 

section 1.2.1. The change .meeting place operation takes two arguments, a unique 

kJentifier of a meeting and a new place for the nteeting. If it finds the meeting in the 

calendar, it change the place of the meeting and returns okay. Otherwise, it returns 

no, such meeting. A portion of an informal definition erf the state machine defining the 

serial specif tcation for the calendar ot^ect is as foNows: 

T,: <change.nieeting.place(m, p), r,, aXokay, r,, a> = change jplace.mj[>.okay 
<change.fneetingi>lace(m, p), r,, aXno.such.meetIng, r,, a> 

* change j>lace.m.p.none 

N,(8, change.piace.inj9.okay) s s' if s contains tlie meeting m and s' s s 

except tliat the place of m is changed to p 
N,(s, change.place.nij>.none> s s if s ttoes not contain the meeting m 

A global.changejneetingj>lace computation invokes a change. meeting place 
operation at each of the partteipants of a meeting. The problem with the semantics of 
change. meeting place is that if a global.change.meetingplace computation is started 
before the corresporKling set_up.meeting computation is committed, their operations 
may arrive at different calendars in different orders and conflicts may be created.^! 



^^The motivation tor executing the ser.t/p./neer/ng and global.cftang0nmetlngj>lMCB computations 
concurrently is that at least moae reactwble partidparMs can b» informed of the place change as early 
as possible. We assume that a participant can observe a tent^ive »t.up.m00ting computation using the 
non-deterministic rsad.ca/isndar operatiora deacribed in Chaplarl . 



67 



The conflicts are created t)ecause the result to a change meeting jylace operation 

depends on whether the meeting exists in the calendar. Restarts may be needed to 

resolve these conflicts. The problem can be avoided with the semantics of 

change meeting jplace modified to the following: 

Tj: <change.meeting.place(m, p), r,, aXokay, r,, a> = change.place.m.p.okay 
N,(s, change.place.mj>.okay) s s' if s contains the meeting m and s s s' 

except the place of m is changed to p 
8 if 8 does not contain the meeting m 

The new semantics implies that changejneetingjplace will change the meeting place 
to the new place if the n'^eting is in the calendar and return okay. Otho^ise, no 
changes are made but okay is still returned. 

With the new semantics, the mark and change. meeting place operations from two 
computatiorts can be executed in different orders in different calendars. No conflicts 
will be created. The only problem left is to maite sure VnaXsatjjp.meeting is serialized 
t>efore global, change jneeting place . It can be accomplished with, for example, the 
assignment of appropriate timestamps in a timestamp stfgorithm. 

In this example, the change, meeting place operation becomes "condltlonar 
because whether it makes any changes to the state depends on whether the meeting 
exists. The reply okay does not indicate orm way or anottier. A similar semantics can 
be used for a cancel jmeeting ofx&ral&on to reduce conflicts. 

A similar but slightly different clc»8 erf situations can be Illustrated by the withdraw 
operation in a t)ank account object. OriginaHy, we have: 

N,(8, withdraw.x.insuf) s a If • < x 
However, by changing the s|3eciflcation to: 

N|t8, withdraw.x.inauf) * a 
a withdraw operation can return insufficientjunda vAmnever there is a possible 
combination of seriaiizatk>n order and operation outcomes that wouki lead to 
insuffksient funds for the withdraw operatton. Conflicts due to other withdraw or 
deposit operations are minimized. 



68 



One can argue that a semantics similar to the more relaxed withdraw.x.insuf 
transition above is necessary for a make. reservation operation in an airline 
reservation object. The semantics Is acceptable because n>ost computations that 
invoke make .reservation operations would prot>ably not expect a reply of 
insufficient tickets to Indicate that there are "absolutely" no tickets left. An airlihe 
reservation object cannot afford to be blocked for other reservation operations 
because a computation that had made a reservation is tentative. A computation may 
last an arbitrarily long period of time, especially when some ejects In the system are 
unreliable. The applrcatlon would rath^ turn away customers when It Is not 
absolutely sure that there is a ticket to be soM.^^ 

3.4.3 Discussion 

A trade-Off between precision and concurrency exists in all these examples. 
Normally, if there are no communication prot>lems and all computations are short, it 
is prot}ably not worthwhile to sacrifice the precision oH the result In exchange for the 
concurrency. However, concurrency becomes a much more serious concern in a 
system with long atomic computatior^. The examples illustrate that there are many 
interesting ^uations in which an appltoatton wouki be willing to exchange the 
precision for the extra concurrency. 

Our approach of relaxing the semantics of the appik:atk>n is not without problems. 
For instance, an implementation of lowerpound balance that always returns zero is a 
correct implementation as zbto is atws^ a valkj result. However, It is not very useful. 
To eliminate this type of behavior, we need to impose additional con^raints on the 
implementation. In this partk:ular exami^, we n^d to assert, in addition to the 
requirements in the serial specification, that there must be a serial history sh 
consistent with the local knowledc^ of the account object, such that the result 
returned by lower_bound.balance is not smaller than tfie balance generated by 



^ ^The fact that airtines overt)ook Vneir flights ckws not change our arguments since there is a limit on 
how much overt)oolung is allowed. 

69 



executing operations in the order of sh. In other words, an implementation should 
only return x when x is a "possible" t)alance. 

Similarly, to eliminate uninteresting implementations that return insufficientjunds to 
a withdraw(x) operation unnecessarily, we assert that there must be a serial history 
sh consistent with the local knowledge of the £K:count c^ject, such that x is larger 
than the balance generated by executing operations in the order of sh. 

3.5 Summary 

In this chapter we descrit)ed a conflict nxxlei, which allows conflict conditions to be 
derived from the serial specification of an object. We argued that the conflict 
conditions are useful indications of the a>ncurrency level of an implementation of 
that object due to the masking of the underlying concurrency control algorithms. 
Based on the conflict conditions, a programmer can determine the sM^propriate trade- 
off tjetween the functk>naiity arKi concurrency of an applicatton. 



70 



Chapter Four 
Implementing Atomic Objects 



In the last chapter we focused on the functionality of abstract objects. We descrit>ect 
how the semantics expressed in the serial specifications of abstract objects can be 
used to increase concurrency over an implcKnentation that uses read/write locks and 
2-phase locking. We discussed how functionality can be traded off for concurrency. 
In this chapter we will describe how abstract otHects can be implemented with a 
concurrency level approximating that of the kiealized implementation in the last 
chapter. 

Like the idealized implementation described in the last chapter, the implementations 
descritied in this chapter are object-oriented. To guarantee that computations 
execute atomicatiy, we ensure that each of the ab^ract ot^ects accessed by a 
computation behaves atomicaily^^. We call £hi oi^ect that tehaves atomically an 
atomic object. The advantage of an object-oriented implementation is its modularity. 
When changes are made in the implemerrtatton of an atomic object, other program 
modules are not affected as tong as the serial specific^ion of the object remains 
unchanged. 

A simple way to implement atomic objects is to build them from smaller atomic 
objects. For example, Argus [31] supports atomic records and atomic arrays. These 
objects are equipped with read/write locks and follow a 2-phase hxking protocol. 
Their recoverability is implonented using some togging or shadow mechanisms. 
Because these "system-level" cttomk: objects provkte the necessary synchronization 



^'^ecail that an otitect tturt behaves atomiciMy guarantees mat ttw committed events involving itself, 
after k)eing renranged according to the seriiriizaiion order, wW be ah acc^}table bansition sequence 
according to the obiect^ ssrial specification. 

71 



and recovery, the implementation of abstract atomic objects on top of them can 
ignore any concurrency or failures in the system. Unfortunately, as we have 
illustrated in previous chapters, using these system-level atomic objects fails to take 
advantage of the semantics of an application. The resulting concurrency level is too 
low for a system with long computations. So in this chapter we will explore how 
abstract atomic objecte can be implemented from otqects that do not mask the 
underlying concurrency and failures. 

There are three goals in this chapter. Rrst, we will introduce programming 
paradign^ that allow abstract atomic objects to be constructed easily. These 
paradigms should not only simplify application ^ogramming, but also help the 
programmer to convince himself of the correctness of the implementations. The 
simplicity of an implementation is an important consideration because subtle 
programming errors can be introduced easily, especially wfien the complexity o^ an 
implementation increases with the desire to increase concurrency. 

Second, our implementations should maximize concurrency while maintaining 
reasonable performance in terms of the computing needed to execute an operation. 
The performance requirements of our implementations are not as stringent as in 
some real-time applications. Comparing long computations and short computations, 
the former are not as sensitive to increases in execution time as the latter. 

Third, the programming interface and programming paradigms used in this chapter 
should make the underiying concurrency control algorithm transparent. Either a 
timestamp algorithm or a locking algorithm, or maybe some other algorithms, could 
be used to determine the serialization order and the actions to take when conflicts 
arise. The motivation for this transparency is that a programmer can implement 
atomic objects without having to learn different concurrency control algorithms. 
Another motivation is that the programs written are portable when the underiying 
concurrency control algorithm changes. Implicit in th» goal is the belief that 
different systems may use different concurrency control algorithms. We wilt justify 



72 



this belief in Chapter 5. 

This chapter is structured in the following way. First, we present an overview of our 
programming paradigms in section 4.1. For the next few sections (4.2 to 4.5) we 
discuss individual aspects of our paradigms in more detail and provide motivation for 
them. Section 4.6 presents some program examples illustrating our paradigms. To 
illustrate that there is enough flexit)tlity in our pariKligms to optimize an 
implementation, we discuss some of the tr£Kle-offe of different implementation 
techniques in section 4.7. 

4.1 Overview of Implementation Paradigms 

When the underlying concurrency and failures are not masked, two issues have to be 
£Kidressed: synchronization and recovery. The implementations described in this 
chapter follow the structure of the idealized implementation in the last chapter 
closely. To simplify synchronization arul recovery, we divide them into two levels. At 
the lower level, concurrent activities ei. an atomic ot^ect are executed such that they 
appear to be instantaneous. Candidates for such activities are the processing of an 
invocation requ^. or the processing erf a message that conveys the outcome of a 
computation. At the higher level, the execution of an sAomic computation is viewed 
as the execution of a collection erf these in^antaneous activities. Since the 
collection of instantaneous activities of two atontic comiHitations can interleave with 
each other arbitrarily, synchronization is needed before processing a new invocation 
request. An operation can only proceed wh«i no conflicts are created. Recovery is 
implemented bv compensating activities whwi an object is informed of the abort of a 
computation. 

4.1.1 Lower*Level Synchronization and Recovery 

In section 4.2, we will discuss tiie iower-levei synchronization and recovery: how to 
make Vhe concurrent activities at an atomic object appear to be in^antaneous. An 
obvious solution is to apply tlie concept of atomicity again. Two Idnds (rf atomic 



73 



computations can be used in an implementation. The first kind of atomic 
computations are the one that we have t)een discussing in previous chapters. They 
invoke operations on atomic objects and can last a long time. The second kind of 
atomic computations are used to make the concurrent activities at an atomic object 
appear to be instantaneous to one another. They are usually much shorter because 
these activities are usually small portions of an atomic computation of the first kind. 

To distinguish the two kinds of atomicity, we call them global atomicity for the long 
atomic computations of the first kind, and local atomicity for the short atomic 
computations of the second kind. The serialization order that the locally atomic 
computations appear to be executing in bears no relationship to that of the globally 
atomic computations. A locally atomic computation can also t3e committed before 
the long globally atomic computation that it is executed in is committed. Globally 
atomic computations appear to execute in a global serialization order and locally 
atomic computations in a local serialization order. 

Since our model of a computation is a sequence of operation invocations at various 
objects, we are essentially implementing a kjng globally atomic computation with a 
collection of short locally atomic computations. In Chapters 2 and 3 we described 
how computations can be made atomic by acces^ng only atomic objects. 
CorrespcMiding to the two kinds of atomic computations are two kinds of atomic 
objects: globally atomic objects and locally atomic objects. A different way to 
understand our implementations is that we are implementing the glot>ally atomic 
objects with locally atomic ones. 

An analogy can be drawn with the tvM> level of ot^ects in System R [14]. In System R, 
a page object is kx^aiiy atomic in the sense that the page locks and recovery 
mechanisms make the RSS actions (e.g., an operation on an index object, which is a 
glot>ally atomic object) appear to be atomic to one another. However, since page 
locks are released at the end of an RSS actk}n, a page object is not globally atomic 
and a higher level of synchronization and recovery is needed on top of the lower level 



74 



of synchronization and recovery provided by the page- locks and recovery 
mechanisms. 

4.1.2 Higher-Level Synchronization 

In section 4.3 we discuss how the higher-level synchronization is implemented. 
Given that each operation on a glot)ally atomic object is executed as a locally atomic 
computation, there is stilt the task of determining v^rether a conflict is created with 
each new operation invocation. In order to determine when conflicts are created, 
each globally atomic object encodes a history of the operations invoked and the 
results returned at that object. When a new invocation request arrives, the locally 
atomic history object is examined to determine whether a conflict is created. If no 
conflict is created, a result is returned and the trar^tion^^ corresponding to the 
operation and its result is added to the histcxy object. Otherwise, a conflict is created 
and must tie resolved. 

A history ot^ect captures the transitions that have been executed at a globally atomic 
object. The important operations of the history objecte are operations to insert a 
transition, delete a transition, enumerate transitions, and update the status of a 
transition, which indicates whether the glot>aity atomic computation invoking that 
transition is committed or tentative. Each operation invoked on a glot)ally atomic 
object will insert a transition into the history object casociated with the glot)alty 
atomic otiject. To prevent a history ot^ect from growir^ indefinitely, committed 
transitions are deleted periodk:aily and "m^^ged" into a more compact 
representation. When transitions are enumerated from a history object, they can be 
filtered by their status or the type of operation and results. The caller of the 
enumerate operation can also supfi^y another tran^tion t and a condition c (e.g., 
"potentially subsequent according to the global s«ialization order") such that only 
transitions that satisfy c with respect to t will be relumed. The words "potentially" 



14 
Recall that a transition Is a piMr of invoke and return ever^ 



75 



and "definitely" capture the local knowledge on the glottal serialization order. With 
the use of the history objects and other locally atomic objects, an operation can 
determine what other operations have been executed at the globally atomic object 
and the possible combinations of serialization orders and operation outcomes. We 
will descritse the implementation of these history objects in more detail in section 
4.3 and Chapter 5. 

4.1.3 Higher-Level Recovery 

In section 4.4 we discuss how the higher-level recovery is implemented. When locally 
atomic objects are used to implement globally atomic objects, locally atomic 
computations are committed before the corresponding globally atomic computation 
is completed. The effects of the operations invoked on the locally atomic objects 
have to be explicitly undone when the globally atomic computation is aborted. The 
combined effects of the original operations and the compensating operations should 
make the globally atomic objects appear to be failure atomic. 

We introduce two recovery paradigms in section 4.4. These paradigms are stylized 
approaches to performing recovery for glot)ally atomic objects implemented with 
locally atomic objects. Their goal is to simplify the writing of applicatk>n-dependent 
recovery code. Simpler code makes it easier to convince ourselves that an 
implementation is correct. 

In the first parsKiigm, only one mutator operation is perfomrted on locally atomic 
objects during an operation on a glot)alty atomic (^ject: inserting a transition into 
the locally atomic history object. When the globally atomic computation containing 
the operation is committed, the transition can be used to determine other mutator 
operations to be performed on other kx:ally atomic objects in the retxesentation of 
the globally atomic object. This type of processing aitsr the globally atomic 
computation is committed is called commit processing. In that sense, the history 
object serves as an intentions list. When the glot>ally atomic computation is aborted, 
the only compensating activity needed is to delete the transition from the history 



76 



object. 

In the second paradigm, arbitrary operations can be invoked on the locally atomic 
objects. Here, the goal is to minimize the work that needs to be done when the 
globally atomic computation is committed. An undo operation is associated with 
each of the tentative operations on a globally atomic object. The undo operation is 
invoked when the tentative operation is at>orted. The undo operation invokes 
compensating operations on the underlying locally atomic objects. The history 
object is a natural place to store names of the undo operations and their arguments. 
In this case, the history ot^ject serves as an undo log. 

4.2 Global Atomicity and Local Atomicity 

The separation of synchronization and recovery into two levels allows division of 
labor and greatly simplifies the task of progrsunming application-dependent 
synchronization and recovery. By limiting the higtier-level synchronization to happen 
at operation boundaries, each globally atomic computation will observe only a limited 
set of well-defined intermediate states of another computation. Similarly, higher-level 
recovery is simplified because the compensating activities, which can be executed as 
locally atomic computatiorra, start with a limited set of well-defined intermediate 
states. 

This section describes the kiea of having two kinds of atomic objects in more detail. 
We will describe how kx»lly atomic objects can be implemented and compare our 
paradigm of implementing ^obaily atomic ot^ects using locally atomic objects with 
related work. 

4.2.1 Definitions of Global Atomicity and Local Atomicity 

With the introduction ci the distinction between gltobsd crtcxnicity and kx^al atomicity, 
we have separated the objects in a system into gk>t>irity atomic ottjects and kx:ally 
atomic objects. Recall that in CheM>ter 2 we have d^ined a history to be atomic if it is 



77 



equivalent to an acceptable serial history. We have defined a serial history sh to be 
acceptable If, by partitioning sh according to the object with which an event Is 
associated, each of the sub-histories Is an acceptable transition sequence according 
to the serial specification of the object associated with that sub-history. The same 
definition can be used to define local atomicity If we limit our attention to locally 
atomic objects. 

A history Is globally atomic if It Is equivalent to an acceptat>le globally serial history. 
A globally serial history is a history in which the events are rearranged according to a 
linearization of the globally atomic computations. A globaiiy serial history sh is 
acceptable if, by partitioning sh according to the globally atomic object with which 
an event Is associated, each of the sub-histories is an acceptable transition sequence 
according to the serial specification of the globally sAomto otjject associated with that 
sub-history. Local atomicity can be defined analogously using the concept of a 
locally serial history in which events are rearranged according to a linearization of 
the locally atomic computations. Notice that there Is a local serialization order and a 
global serialization order. The behavior of the locally atomic objects Is not 
necessarily valid according to the global serialization order. 

4.2.2 Implementing Locally Atomic Computations 

We assume that a programmer can declare the bourKlaries of locally and glot>ally 
atomic computations. An access to a globally (locally) ittomic object should always 
be enclosed in a globally (locaity) atomic computation. Typically, a locally atomic 
computation is a small portion of a giot}aliy atomic (x>mputation and should last only 
a short time (e.g. executing an operation on a globally atomic object). A globally 
atomic computation can contain several locally atomic computations. A locally 
atomic computation is committed when it terminates successfully. The locally atomic 
computation remains committed &\mn tfKXjgh ttie glot)aHy atomic computation that 
contains It may be aborted later, htotice that given ttie same aerial specification, the 
concurrency of a locally atomic object is potentially much higher than a glot)aliy 



78 



atomic object because of the shorter locally atomic computations. If a locking 
algorithm are used to implement an atomic object, a shorter computation allows 
locks to be released sooner. 

Locally atomic objects can be implemented using a traditional concurrency control 
algorithm and recovery mechanisms based on read/write semantics [1 7]. With such 
an implementation, it is in general inappropriate to access a locally atomic object in a 
long locally atomic computation. The concurrency of the impienientattons described 
in this chapter depends on the use of short locally atomic computations. 
Alternatively, the same implementation paradigm described in this chapter can be 
used to implement the locally atomic objects as well as the glotjaliy atomic objects. A 
multiple-level atomicity model can t>e exterujed easily from the current dichotomy of 
global atomicity and local atomicity. In section 4.7.2 we will explore the situations in 
which the generality of multiple-level atomicity is needed. 

In order for the effects of an atomic computation to remain permanent, the updates 
made by the computation have to be stored into stable memory when the 
computation commits. Afterwards, the updates will survive site crashes. If accessing 
stable memory is expensive, tfie cost of implementing each operation to a globally 
atomic object with a locally atomic computation may t}ecome prohibitive. 

To avoid the cost of accessing stat>le memory every time a locally atomic 
computation is committed, we can make use of the fact that locally atomic 
computations are not invoked directly by human users. Consequently, the changes 
made by a locally atomic computation a do not have to be ^ored in stable memory 
until the globally atomic computation that contsuns a commits, or until other locally 
atomic computations store their changes in stable memory. The latter condition is 
needed because other locally atomic computations may have ot)served the effects of 
a. Since these other kx^ally atomic computations can aiao delay their access to the 
stable memory, all the access^ due to the comnritments of locally atomic 
computations can be piggybacked on a single acc^s when some giot>ally atomic 



79 



computation commits. The details of how a distributed computation coordinates its 
accesses to stable memory in different sites will be discussed in section 5.4. 

4.2.3 Related Work 

The same idea of having multiple levels of atomicity has been suggested by Beer! et 
al. in [5] and Moss et al. in [42]. The difference between our work and theirs lies in 
how synchronization and recovery is performed. To implement serializability, Moss 
proposes a conflict-based locking mechanism: locks to the level 1-1 objects are 
released when the level I operation that accessed them is finished. However, a lock 
at level I is retained so that conflicting level i operations are delayed. In [42] tfie 
conditions under which "simple aborts" exists are also d^ved: recovery of a level I 
object can be achieved by simply omitting tt>e effects of the operations on the level 
1-1 objects. The conditions require that no conflicting level 1-1 operations have beert 
executed by other level I operations. 

Weihl[55] describes how atomic objects can be built with other smaller atomic 
objects and muf ex objects. Mutex ot^jects behave like monitors [21]. Programs can 
acquire and release mutex objects to achieve mutual exclusion. The activities 
performed while the mutex lock is hekl are s^aiized as a result. The mutex objects 
can be viewed as a simple way to implement local atomicity. 

4.3 Synchronization 

Since kx:ally atomic computatknis might not be serialized in the glot>al serialization 
order, a higher level of synchronizatk>n is needed so that the behavk>r of a globally 
atomic object appears to be gtobally atcmiic. Our approach to the higher-level 
synchronization is to capture sufficient information df the history of events generated 
at a glot)alty atomic object using history ob^cts, so ^at it can be used to determine 
whether conflicts are created. Since firil the r^evsmt tocal infomnation in our atomicity 
definition is being captured by these history c^qjects, our apfxroaxih is "comptete" in 
the sense that unnecessary confitots need not be created except when there is a lack 



80 



of global or future knowledge. 

In [55] tfie state of an atomic object Is encoded with smaller atomic objects and 
monitor-like mutex objects. By encoding enough information in these objects, an 
incoming operation can determine whether conflicts are created and delaying is 
necessary. It is up to each application to determine how the state is to tse encoded. 
To simplify programming, we have provided a more stylized approach of using history 
objects for the same purpose. 

In general we can expect our programs that handle an invocation request to follow 

the following pattern: 

If condltlonl th«n ...; Insert tl Into history; roturn resultl snd 
1f cond1t1on2 tiion ...; Inssrt tZ Into history; roturn rosultZ ond 

• • • 

If condition!! thon ...; Insort tN Into history; roturn rosultN ond 
resolvo conflict 

In the expressions condltlonl the program determines whether certain transitions 

can be generated without creating any conflicts. If an operation can proceed 

immediately, a valid result is determined from the history object and other objects. A 

transition for this operation can be inserted into tfie history object to be examined by 

later operations t}efore returning the result. If none of the condltlonl's are satisfied, 

a conflict is created and must t>e resolved. In the rest of this section we will first 

discuss the operations provided t^ a history object that plays an important role in the 

programming of the condltlonl expressions. Then we will discuss the rosolvo 

conf 1 1 ct statement. 

4.3.1 History Objects 

Figure 4-1 describes an informal specification of the interface of a history c^ject. To 
avoid a lengthy digression d^cribing ail the operations supported by a history 
object, figure 4-1 is only a partial list and presents only the operations relevant to our 
approach of higher-level synchronization. We will continue our description of history 
operations in section 4.5.1 . 



81 



A history object can be pictured as a tree of transition objects. These transition 

objects correspond to the different types of transitions in a serial specification. The 

order of the transitions in the tree is determined by their gtot>ai serialization order. A 

tree instead of a linear list is used tsecause a history object may not have complete 

knowledge of the serialization order. We will not discuss the transition objects in this 

section. A informal specification of their interface will be presented in section 4.5.2. 

p_sub - proc«dure(h: history, t: transition) r«turns(h1story) 

% returns tho largost sub-history of h In which all tha transitions 

X are potentially subsaquant to t. 

p_pr1or - proce(iure(h: history, t: transition) roturns(hlstory) 

X returns the largest sub-history of h In which all the transitions 

X are potentially prior to t. 

d_sub - procedure(h: history, t: transition) returns(hlstory) 
X returns a sub-history of h In which all the transitions are 
X definitely subsequent to t. 

d_pr1or - procedure(h: history, t: transition) returns(hlstory) 
X returns a sub-history of h In which all the transitions are 
X definitely prior to t. 

p_between - procedure(h: history, tl. t2: transition) returns(hlstory) 
X returns a sub-history of h In which all the transitions are 
X potentially subsequent to tl and potentially prior to t2. 

d_between - procedure(h: history, tl, tZ: transition) returns(hlstory) 
X returns a sub-history of h In which all the transitions are 
X definitely subsequent to tl and definitely prior to tZ. 

Figure 4- 1 :lnterface of a History Object 



4.3.1 .1 Masking Concurrency Control Algorithmt 

In order to nriask the concurrency control erigorithm used underneath the 
programming interface, we follow the conflict model described in section ai. We 
assume that each history object has some luiowledge of op^ation outcomes and the 
global serialization order determined t}y the concurrency control algorithm 
underneath. The operations p.sub, djsub, pj>rior, dj)rior, pjyetween, d, between 
supported by a history object reflect the view. For example, p.sub te^es a history 



82 



object as its first argument and a transition object as its second argument, and 
returns the sub-history in which the transitions are serialized potentially before the 
argument transition. How the transitions in the sub-history are ordered is again 
determined by the global serialization order. The uncertainty about operation 
outcomes can be reflected with an attribute on the transition objects, which are 
either committed or tentative. Aborted transitions will be deleted from a history 
object. We will describe the use of this attribute in sections 4.5.2 and 4.5.3. 

These p. and d, operations can be implemented and optimized rather 
straightforwardly given the underiying concurrency control algorithm. For example, 
if the global serialization order is determined by timestamps assigned at the 
beginning of a globally atomic computation, the tree in which the transition objects 
are arranged degenerates to an ordered list, since the s^obal serialization order is 
known when an operation on an globally atomic dt>ject is invoked. There is £dso no 
difference between the d. and p. operations. A different implementation is required 
for a concurrency control algorithm in which the global serialization order is 
determined in a way similar to 2-phase locking. We wiii defer our discussion of 
concurrency control algorithms and how these history operations can be 
implemented until Chapter 5. Note that an imp^nentation for th^e operations does 
not necessarily have to copy the history object. A lazy evaluation scheme can be 
used to enumerate the transitions in the returrted hist(xy object without changing the 
semantics of the operations. 

With the p. and d, operations to capture the global serialization order relationship 
among the transiticms in a history ok)ject, the coru;urrency control algorithm used by 
the system becomes transparent to the application programmers. Although 
application-dependent synchronization is needed in an implementation, the 
programmer does not have to be aware of ttie chc^ce of concurrency control 
algorithm made by ttie system. This transparency is the primary characteristic that 
distinguishes our proposal from all previous orves that involve application-dependent 
synchronization. 



83 



4.3.1 .2 Advantages and Disadvantages of Transparency 

There are both advantages and disadvantages of proy/id'mg this transparency. There 
are two advantages. First, programniers do not have to understand the details of 
different concurrency control algorithms. The same conflict model can be used 
during programming. Second, the application fxrogrsans romun correct even when 
the underlying concurrency control algorithm is changod. No program modification 
is needed. One may question how often a concurrency control algorithm would 
change underneath ttie s^ication programs. A situation in which thte may happen 
is when application progrsmfis are ported, especMly for "common" attract ottjects 
such as a RFO-queue, a set. or some kind c^ table. Anottier po^tMlity is for a system 
to change its concurrency control algorittim in orctor to combine with another sy^em, 
so that computations that span t)oth systems can be executed. 

One of the disadvantages of the transpar«icy is ite over-generality. Application 
programs t}ecome more difficult to v^ite than rtMeesary. For exarnple, given a 
timestamp concurrency control algcMithm, the serialization order is always known. 
The difference t>etween p. stfidd.disafHoeeH^ Furthttmore, the programmer does not 
need to (x>nsider cases wfiere a transition is botti potentially sut>sequent and 
potentially prior to another tr»i8ition. 

Another possible disadvant«^ of ttie transparency Is decreased efficiency. An 
applicatk)n program may require aever«i passes ov«r a general history ol^ect to 
determine v^^hether a conflict is created. On the other hand, tiecause erf the simpler 
structure of a history object whw the ooncurrmicy contrcN irigorithm is known, one- 
pass versions can be coratructed more easily tttan when the concurrency control 
algorithm te tran^Mrent. 

Whether these disadvoitages outw«^ the advafrtagos cannot be evaluated without 
more experience imiMenwiting abateract atomic obieclB. On the other hand, it seems 
that without actuid experi<MiM of tiie pertommnce of ^ffarwit concurrency control 
algorithms, a safer investment wouki be to eam^heaim p(»ttf3ility. 



84 



4.3.2 Resolving Conflicts 

When an object decides that a conflict has been created, it must resolve the conflict. 
Depending on the concurrency control algorithm, and why the conflict arises, the 
range of actions that may be taken include delaying the current operation, restarting 
the current computation, or making an assumption of the serialization order or some 
other transition's outcome. 

Ideally, a programming interface can provkte a generic rasol v« conf 1 let statement 
which maintains the transparency of the concurrency control algorithm underneath. 
An intelligent compiler or run-time system can ^nerate code to determine the 
actions to take, such as when to reschedule a request if delay is needed, or whether 
to restart a computation or delay a request, or what assumptions to m^e. However, 
supporting such a generic statement efficientty is difficult as conflict conditions can 
be arbitrary expressions. 

Depending on the concurrency control algorithm, simple-minded solutions can be 
devised. For example, in sonrte algorithms a request would be able to proceed given 
that sufficient time has passed. In those aigorithnra, a simple solution is to 
reschedule an invocation request periodically. For some other algorithms, in which 
restarts and delays are the only two p08Sit)le altemati\«8 to resolve a conflict, the 
more pessimistic restart can be chosen whene^^r delays do not guarantee eventual 
progress. For algorithms that makes assumptions on operatitm outcomes and the 
serialization order, different assumptions can be tried to determine wtiether they can 
maintain a valid t)ehavior given that those a^umptions are correct. 

The drawt}ack of these simple-minded solutions is the loss of concurrency in the 
form of unnecessary cteiays, spurious reschedules, unnecessary restarts, or 
unnecessary assumptions. To provide a compromise b^ween this loss of 
concurrency and a comf^icated programming interface, we replace the r«solv« 
conf 1 let statement with a r«try statement and require the programmers to specify 
a proceed condition with a rttry statement. The purpose of the proceed conditions 



85 



is to provide a liint to the language system as to when conflicts would not be created. 
The structure of the proceed conditions is required to obey a well-formedness 
requirement descrit>ed below so that a proceed condition can be analyzed by the 
language system. Based on the proceed conditions, the language system can 
determine whether a delay would lead to eventual progress, when to reschedule, or 
what assumptions to m£^e. 

A proceed condition is taken as a hint to the condition under which an invocation 
request would be able to proceed. However, in order to guarantee that a request is 
not delayed indefinitely, a proceed condition should be well-formed. A well-formed 
proceed condition satisfies the following requirements: 

1 . The proceed condition should be satisfied if: 

a. new operations are not started, and 

b. all current operations in the sy^em, except the one t)eing 
considered, are finalized and the outcomes are known by all 
history objects, and 

c. the operation tseing considered is serialized after all existing 
transitions and the global serialization order among existing 
transitions are known. 

2. It is not satisfied currently. 

3. It is constructed with boolean operations and the operations provided by 
the history ot>iects. 

The first two requirements guarantee that by analyzing a proceed condition, a 
language implementation can discover the set of "events" that may cause the 
proceed condition to become satisfied. In some concurrency control algorithms, 
these events may correspond to tfie finalization of incomplete computations. In some 
algorithms, the events may involve a restart of tfie computation that invokes the 
current operation. The first requirement ixevents situations in which the proceed 
condition is too restrictive. If a proceed condition is too restrictive, the current 
operation may never be rescheduled, or unnecessary restarts may be Initiated. 
Application programmers shouM expect the language imf^ementation to make t)etter 
decisions in determining how to resolve a conflict if ttie proceed condition is a closer 



86 



approximation of the negation of the conflict condition. The second requirement 
prevents situations in which the proceed condition is already satisfied. If the proceed 
condition is already satisfied, the language implementation may not be able to 
determine the set of events that can cause the current operation to resolve the 
conflict. In that case, the only alternative is busy-waiting in the form of constantly 
rescheduling or constantly restarting. It is prot>ably not the most desirabie solution. 
The third requirement r^tricts the structure of a proceed condition so that it can be 
analyzed by the language implementation. In Chapter 5 we will describe how a 
language implementation can use the proceed conditions to ctetermine the actions 
that need to be taken to resolve a conflict. 

The retry statements are also paired with bagin vntry statements so that a 
program that uses the r«try statement has ttie following form: 

• • « 

bsgin entry 

If condltlonl ... end 

• • • 

1f condltlonN ... tnd 

retry whenever c X c Is a proceed condition 

The semantics of the retry statement is to abort any w<Mrk performed in the last retry 

loop and retry from the matching begin entry statement. A retry might be attempted 

after a certain delay, a computation restart, or the noting of some assumptions. The 

proceed condition c may or may not be satisfied when the loop is r^ried. 

4.4 Recovery 

When locally atomic objects are used to implement ^obally atomic objects, the 
effects of a committed locally atomic computatkm have to be compensated explicitly 
when the containing glotmliy atomic computation aborts. This section describes how 
these compensating activity can be prograiraned. Simitar ideas have been 
proposed in [55, 38, 1]. We will not present any comparteon since the purpose of this 
section is merely to show that recovery paradigrro compiMt)te with the rest of our 
implementation par^igm can be designed. 



87 



We will describe two different recovery paradigms in this section. One of them uses 
tfie history objects as intentions lists and the other uses them as undo logs. The two 
paradigms described in this section are not mutually exclusive methods; rather, they 
represent two ends of a spectrum of possibilities. For example, an application may 
use one paradigm for certain operations and the other paradigm for the other 
operations. Depending on the type of an application, one paradigm may be more 
efficient and/or convenient than the other. 

In addition to performing compensating activities, it is also necessary to condense 
the information contained in the history obiects which would otherwise grow 
indefinitely. We can condense the information contained In the transition objects 
with a more compact representation after tfiey are committed. How the compaction 
is performed is related to the recovery paradigm. 

4.4.1 Intentions list Paradigm 

In the intentions list paradigm, the state of a globally atomic object is represented by 
a collection of locally atomic objects (which will be called a snapshot) and a locally 
atomic history object. The history obj^:t records the transitions of the Ofi^rations 
that have been invoked at the glotmliy atomic object. For committed transitions, the 
application can specify a locally atomic computation which merges their effects into 
the snapshot and delete them from the history object. Alx>rted transitions can be 
deleted without further action. This kind of commit processing can be viewed as 
taking the processing "off-line" after the serialization or6&r and the outcomes of the 
transitions are known. To simplify the applteation, the committed transitions are 
merged in the global serialization order. In <Ah&r words, a committed transitions can 
be merged only if there are no prior unmerged transitions in the history ot^ect. 

When an operation is invoked on the globally sftomic ol^ect, the snapshot and the 
history object are examined to detern^ne whether a conflict is created. If the 
operation can proceed immediately, a vaiki result for tiie operation Is also determined 
from the snapshot and the history object. Before returning, the transition for this 



88 



operation is inserted into the history object. The accesses to the snapshot and the 
history object are executed in a locally atomic computation. 

The intentions list paradigm minimizes the work performed when an operation on the 
globally atomic object is at}orted. If the operation is aborted before the locally atomic 
computation is committed, changes to all the locally atomic objects will be undone. If 
the operation is aborted afterwards, the only compensating activity needed is to 
delete the corresponding transition from the hi^ory ot^ect. The deletion can be 
executed as a short locally atomic computation. 

Deleting transitions from the history object and merging them into the snapshot as 
soon as they are committed may create a problem. Occasionally, a committed 
transition may be needed in a history object to determine whether conflicts are 
created for an incoming operation that can t>e serialized before it. Depending on the 
concurrency control algorithm, committed transition nnay or may not be needed. In a 
2-phase locking algorithm, a committed transition is never needed and a transition 
can be deleted when it is merged. In a timestamp algorithm, a transition is useful in 
determining whether conflicts are generated when operations with older timestamps 
are invoked. If committed transitions are deleted, ir>coming operations with okJer 
timestamps must be refused. 

A solution to this prot)iem is to keep a sequence of pairs of snai:»hots and history 
objects. Before deleting committed transitions from a history object and modifying 
the snapshot, a copy of the hi^ory object arKi the snapshot can be kept. For an 
incoming operation o, the appropriate pair of snapshot and history object that is the 
most updated and yet contains ail the transitions that may be serialized after o in the 
history object can be chosen. A complication arises when inserting a transition. 
Since the transition has to be inserted into all the history object versions, those that 
have already deleted transitions prior to the transition tieing inserted have to be 
discarded. 

Since storage is limited, some of the pairs are also discarded when it becomes more 

89 



and more unlikely to have Incoming operations that need to access the transitions in 
those pairs. In a timestamp algorithm that assigns timestamps using a real-time 
clock, transitions invoked by computations that are started tjefore the currently 
executing computations in a system can be discarded. Without global knowledge, a 
transition can be deleted if it is estimated to be older than the currently executing 
computations. If an okJer computation is still executing and access the history object 
later, it has to be restarted. Notice that a language implementation can make the 
maintenance of a sequence of pairs of snap^ot and history object transparent to the 
programmer. It can also make the copying of history ot^t and snapshot more 
efficient by, for example, keeping one hi^ory ot^ect and having each history object 
"version " be an index to this single history c^jject. 

4.4.2 Undo Log Paradigm 

In the undo log paradigm, the state of a globally adomic object is represented by a 
collection of locally atomic objects (which will be called a projection) and a locally 
atomic history object. In this paradigm, in^ead of mergirtg the committed transitions 
during commit processing, the rM'ojection is mutated before an operation on the 
globally atomic ot}ject returns. The transitkMi for the operation is also inserted Into 
the history object. The projection should represwt the correct at>stract state 
according to any gk>bal serialization order in which the transitions in the history 
object may be serialiHid, even th(Mjgh there may be many such orders. No extra 
work is needed if all the tentative operations eventually commit. The accesses to ttie 
projection and the history ot^ect are executeti m a locally sftomic computation. 

If an operation on the globally atomic object is e^}orted before the kx:ally atomic 
computation commits, changes made to the kx»lly atomic objects will be undone. 
No extra work from tfie i4>pKcatk>n is necessary. If ttie operation is aborted after the 
locally atomk; computation is committed, the ^oortod trsuisition will be delved from 
the history object and it will be "unmerged" from the proiection with an undo 
operation. The undo operation should comperasto for the previous mutation of the 



90 



projection and preserve the failure atomicity of the giotjally atomic object. If two 
operations are aborted because one of their common ancestor actions is aborted, 
the undo operation of the operation serialized afterwards is invoked first. An undo 
operation, along with its arguments, is specified t>y each operation on the gtotialiy 
atomic object before the latter returns, and remembered in the transition in the 
history object. The undo operation and the deletion of the transition from the history 
object are executed in a locally atomic computation. The high-level synchronization 
performed by an implementation, by guaranteeing that only atomic histories are 
generated, ensures that this locally atomic computation does not encounter any 
permanent failures. For example, the undo operation of a deposit operation in an 
account object deducts the amount deposited from the projection. Since an 
implementation should be prepared for the possibility <rf the deposit tseing aborted, 
there is always enough funds in projection to cover the undo operation. Transient 
failures that interrupt the undo operation, such as site crashes, can be masked by 
retrying. 

The projection and the history object will be used to determine a valid result that can 
be returned to an operation invoked at the gtobaiiy atomic ot^ect, and to determine 
whether conflicts are created. As will be discussed in section 4.7.1, the undo log 
paradigm may be more Violent than the intentions list paradigm in some 
applications. The comparison of the two recovery parsKligms will be delayed until v/e 
have presented some example programs. 

Two problems with the undo log paradigm prevent its applicability to general 
applications. The first problem arises because the paradigm requires the projection 
to be maintained such that it represents the correct abstrcK^t ^ate according to any 
global serialization order in which the transitions in the history object may be 
serialized, even though there may be many such orders. For some applicattons, this 
is not possit}le. For example, when the operatkms insertd) and deleted) are executed 
on a set object, the correct projection state depends on the serialization order of the 
two operations. 



91 



There are two possible solutions for this problem. One of them is to regard this 
situation as the creation of a conflict. This is not Ideal, as concurrency Is decreased 
unnecessarily. In the set example above, no conflicts are generated by the insert(i) 
and deleted) operations, since the only valid reply for both operations is okay, which 
would remain valid regardless of the serialization order. Another possibility is to 
allow the projection to be modified when the operation commits. This is also 
undesirable because of the complexity introduced into the structure of the 
projection. 

The second problem arises t>ecause occasionally the projection and a history suffix 
do not capture enough information on the entire history c^ previous invocations. For 
example, suppose the projection is a locally atomic array object representing the 
at)stract state of a set ot)ject. Invoking an insert(e) op&^tion causes e to tie inserted 
into the array. Oi the other hand, after inserting e from the array, we have lost the 
information indicating whetfier e was in the array before the insert(e) operation was 
invoked, unless the history object contains the transition for the last insert(e) or 
delete(e) operation prior to this operation. This information is needed in case the 
operation is aborted, and also to determine the result of a member(e) operation 
serialized before the insert(e) operation. Selecting an undo (^^eration based on the 
state of the projection when the insert(e) operation is executed is not an adequate 
solution either, since the ^ate of the projection might tie changed b^ other undo 
operations. 

As a remedy, we can delay the deletion of transitions from the history object, or add a 
snapshot to the state. This raises the question c^ flow transitions should tie deleted 
from the history object. A possibility is to declare all committed transitions which are 
certain to be serialized before all other tentative transitions eligil>ie for deletion. 
However, this does not solve the protiiem descritied above. A more complicated 
scheme in which the appircation makes the final ctocision over which transiticm is 
deleted can tie devised. However, it seems compttoated and may add a significant 
cost to accessing the history ot^ect 



92 



The addition of a snapshot can be regarded as a conmbination of the two recovery 
paradigms. Snapshots can be maintained as described in the last section. They can 
also be derived more cheaply than described in the previous section by saving old 
projections. After all the mutator operations that have been merged into the 
projection are committed, the projection can be regarded as a snapshot. 

A final possibility is to encode the necessary information in a more complicated way. 
In the set examp>ie, we can associate an item in the array with a linked list of boolean 
values. When a delete or insert operation is invoked/aborted, a kxx>lean value can 
be inserted/removed from the list. Boolean values at the beginning of the 11^ can be 
removed by a background process as long a& there is a suteequent boolean value 
inserted by a committed operation. 

Despite these limitations and complications, the undo paradigm is still useful in many 
applications which do not have "overwrite" operations. These overwrite operations, 
such as insert(e) in a set object, have the charercteristic that they destroy some 
significant piece of information in the oki state necessary for recovery. Without 
"overwrite" operations, an operation can detemiine all tlie necessary information 
from the projection aruj the history ot^ect 

4.5 Programming interface 

This section describes some more programming constructs in order to present the 
program examples In section 4.6. Howe\^', tiiis is not meant to be a language 
proposal. There is a tradchoff involved in introducing spedaiized constructs into a 
language. While the programs that motiv^to Vneae con^ructs become more efficient 
and easier to write, the language sdso becomes more complicated and specialized. 
More detailed study is needed t)6fore cteckjing what linguistic constructs are 
desirable. 



93 



4.5.1 History Objects Continued 

The following is a continuation of the description of the operations provided by a 
history object. 

d«1eta_f1rst - procedure(h: history) raturns(transltlon) 
X returns the transition In h that Is serial Izad before all other 
X transitions and Is coanltted. The transition returned Is deleted 
X froM h. 

match - 1terator(h: history, t: tesiplate) Iterates(transltlon) 
X Iterates the transitions In h that watches t. 

exists - procedure(h: history, t: teaplate, 

p: proctype(transltlon) returns(bool)) returns(bool) 
X returns true If there Is a transition s In h such that s Matches the 
X template t and p(s) returns true. Otherwise false Is returned. 
X p Is an optional argument. If p Is omitted, only the template t Is 
X used to filter transitions In h. 

X The following operations are Internal and Invoked only by the 
X language system Implementation that we will discuss. 

Insert •• procedure(h: history, t: transition) 

X Inserts t Into h. This operation Is Invoked by the language 

X Implementation when an operation on a globally atomic object returns. 

get_trans1t1ons • 1terator(h: history, a: act1on_1d) 

1terates( transitions) 
X Iterates the transitions that are executed In a. This operation Is 
X used by the language Implementation to s«arch for transitions whose 
X status should be updated when Information about the outcome of an 
X action Is received. We will not show the Invocations of these 
X operations In our prograaw In section 4.6. The update of 
X the status of a transition can be executed In a locally atomic 
X computation. 



In addition to the sub, prior, and between operations descrit)ed in section 4.3.1, the 
history objects also support an exists operation ami a match (^aeration which allow 
for searching of particular transitions in a history object. The exists operation takes a 
history object, a transition template, and a procedure as arguments. Transition 
templates vtnit be described in section 4.5.3. Both ttie transition tOTiplate and the 
procedure argument &re used to filter the transitions in the history object. The 
procedure in the procedure argument tidies a transition as an argument and returrts 
a t}00lean. The exists operation also returns a booiettfi as its result. It returns f rue if 



94 



there is a transition in the argument history object that niatches the template and true 
is returned when the procedure argument is invoked with this transition. False is 
returned otherwise. 

In the example progran^ that we will present in section 4.6, we will in fact need a 
closure rather than a procedure in most calls to exists. To allow clc^ures to be 
passed, the programmer can specify a multiple-argument procedure p: 
p - proctdur«(argj: T^, argj: T2. .... Acgii: transition) r«turns(boo1) 

and the closure can be specified as p(aj . aj H-i^ >^ere a^ is an object of 

typeT^. 

4.5.2 Transition Objects 

Figure 4-2 is an informal specification of the operations supported by a transiti(Mi 
object. A transition object can be regarded as a type of record, with various 
attributes. 

gat_argl ■> proc«dura(t: transition) raturns(typa_of_argl^') 
X raturn tha first argiMant of tha oparatlon rapraaantad by t. 

X Sinllarly for gat_arg2. gat_arg3 gat.rasultl, ... 

natch • procadura(t: transition, taap: tan^lata) raturns(bool) 
X raturna trua If t aatcbaa taap. otharwlsa falaa la raturnad. 

sat_status ■ procadura(t: transition) 
X sat tha status of t to ba eowalttad. 

sat.undo - procadura(t: transition, undo: proctypa) 

X roMaMbars undo as tha undo oparatlon of t. 

X This Is naadad only whan tha undo racovary paradlgii Is usad. 

Figure 4-2:lnterface of a Tran^tion Object 
We assume that the language system supports at>breviations of the form: 



^o avoid duttsring our profyama with axceaeive typa infomu^on, we have atf>andoned strict typing 
hers. However, ttiia is not a serious problem as tranaition obiecta «id hiMory obiects can be 
par»neterized. 

96 



transit1on$get_argl(t) abbreviated as t.argi 
trans1t1on$sat_undo(t, s) abbreviated as t.undo :> s 

In our example programs, a procedure either returns normally or signals an 
exception. We use a special keyword okay to represent the result of a transition 
when no results are returned. The exception name is used as the result value when 
an exception is signalled. 

We also assume that the language system supports a distinguished variable 
thls_transltlon. This variable can be regarded as the current transition being 
executed. It can be implemented with a value of the current action identifier which 
allows comparison with other tran^tions to determine the relative global serialization 
order. For example: h1atory$p_sub(h. ttils.transltlon) returns a history that 
only has transitions that are potentially subsequent to the caller. We assume that a 
program can execute 

thls.transltlon.undo :■ p 
to indicate to the language system that the undo operation of the current transition is 
P- 

4.5.3 Template Objects 

Template objects can be used to match against transitions and filter out irrelevant 
transitions in a history otiject. When defining transition templates, programmers are 
interested only in the status of a transition, Vne types of Vhe operation and result, ttie 
arguments, and \he values of the results. We will ignore the CK^tion identifiers and 
object identifiers of the transitions in tfie temptales. For example, for the set object 
defined in figure 2-2 on page 47, we assume that tf^ language interpreter can parse 
transition templates of the form: 

coiMl ttad_n«Mbar_x_tru« 

committed transitions of the form <aMiib«r(x)><tru«>. 

tantat1va_1n8art_x_okay 

tentative transition of the form <lns«rt(x)><okay>. 



96 



Determining whether a transition is committed is slightly more complicated than one 
would expect. Normally, one would expect an implementation of a transition object 
to associate a status flag with a transition and determine the status accordingly. A 
complication arises when an action currently executing belongs to the same 
computation as some of the transitions in the history object. Since an operation may 
expect to see the effects of other transitions in the same computation that are 
definitely prior to and would not be aborted independently from itself, the set of 
"coMMltted" transitions is defined to include these transitions also. Given that 
another transition t is prior to the current operation o and belongs to the same 
computation as o, and the names of the actions executing t anjd o are at and ao 
respectively, determining whether t would abort Independmitly from o can follow the 
following algorithm: 

1 . If an ancestor of at (or at itself) and an ancestor of ao (or ao itself) are 
parallel sibling actions, then t can be aborted indeperKientiy from o, 

2. otherwise it is not po88it>le. 

The action identifiers of at and ao can be used to determine the family relationship. 

To avoid long template names in our programs, we assume that abbreviations can be 
defined (e.g., Ins.x - 1ns«rt_x_okay). For simitar reasons, we assume that 
templates can be constructed from other templates using boolean operations. For 
example, if a program defined succ«s«fu1_updat« ■ withdraw.x.okay or 
d«pos1t_x_okay, then a transition matches succ«tsfu1_upd8t« if it matches either 
wlthdraw.x.okay or d«pot1t_x_okay. We also assume ^at templates with fixed 
values in the transition argument or results can be cona^cted. For example, if x is 
an variable defined in a program, then inaart.x.okay defines a template that 
matches any transition of an insert operation invoked witii the object denoted by x. 

4.5.4 Resource Managers 

The program examples in this chapter are struchired in nfKXluies called resource 
managers, which are simitar to the guardians in [30]. In fact, many of tfie linguistic 
constructs are copied from the Argus language descrit>ed in [30]. 



97 



At run time, an instance of a resource manager can be instantiated on a site. (In the 
rest of tfiis chapter, the term "resource manager" will be used to refer to an instance 
of a resource manager and no distinction will be ntade.) Each resource manager can 
be regarded as a virtual site in the system, with a name known by other resource 
managers. We assume there are name servers [1 1 , 45] which map resource manager 
names to network addresses. The indirection allows a resource manager to be 
moved to a different physical site easily. Multiple resource managers can be 
instantiated on a single physical site. 

A resource manager has associated with it a collection of procedurea These 
procedures share some state, which only procedures from this resource manager 
can access. A sut)set of these procedures are exported and can be called by 
procedures outside the resource manaQw. 

There are many pc^sibie ways in which top-level actk>ns and sub-actions can be 
declared. Since choosing the best way for these declarations is not relevant to the 
ideas proposed in this thesis, we simply assume that ail our example programs are 
executed in some globally atomic computation. To insulate ttie caller of a procedure 
from the site crashes at the resource manager invoked, we also assume that a sub- 
action surrounding the call would tie created if the ccUier executes in a different 
resource manager. We assume that the loc^ly atomic computation boundaries are 
defined tty a b«g1n entry . . . retry wli«n«vtr c statement or a b«g-tn local 
coMputatlon ... •nd local coMputftt Ion statement. 

The caller and caltee of a procedure, if they are in different resource managers, 
communicate using a remote procedure call (F)PC) peuradigm: the caller suspends 
until the remote procedure returns. To facilitate tfie implementation of atomic 
objects, we use a zoro-or-once semanttes: when the remote action returns, the action 
invoked was executed exactly once; otherwise it is executed at most once. We 
assume that the system will generate an exoeptton to the applk:ation when a 
response has not t)een received for a remote csrii after a system-defined timeout. In 



96 



Chapter 7 we will describe the measures that the application can take to handle these 
communication failures. For the time t>eing, we assume that the remote action will be 
aborted. 

Resource managers can be used to implement atomic objects. An object may be 
implemented within a single resource manager, or with several nraource managers. 
Dejoending on the overhead erf using a resource manager, an application may decide, 
for example, to implement a angle bank account with a rmource manager, or many 
accounts with a resource manager. Procedtves in a resource manager can be used 
to implement the operations on an object. 

We assume that the objects used in our example programs are locally atomic unless 
otherwise specified. Two kinds of locally atomic objects me used: history objects 
and regular objects. Regular ot^ects consi^ of the usual array, record, ... Int 
types, which have the usual serial semantics expected for Vneae types. . 

For the time being, we assume each resource manager has a distinguished history 
object called history.sufflx. Several atomic objects implemented on the same 
resource manager will share the same history ot)ject. To shorten our progran», we 
also assume the distinguished history ot^ect is the first argument in a history 
operation if it is not supplied. Tran^tk>ns are inserted into h1story_tuff1x 
automatically when an operation on a globally atomic object returns. When an motion 
is committed, the ^atus of the transitions in h1atory_suff1x is updated 
automatically. When an actton to aborted, the abottied tranaMions in hlstory.suf f lx 
are deleted automaticaliy. 

4.6 Program Examples 

Figures 4-3 on page 100 and 4-4 cm page 103 show two appUcatk>n prograrm. Rgure 
4-3 shows an implementation of tfte set ot^ect erf section 2.3.2 with the intentions list 
paradigm. The implementation is parameterized by the type of the items in a set. 
Three operations, insort, delete, and member, are supported. The state of ttie set is 



99 



X This example uses the Intentions list paradlga. 

sat[T] - resource.manager Is Insert, delete. HMiriier 

X abbreviations for transition teaplates 
no_x ■ iM«ber_x_fa1se X <neiiber(x}><fa1se> 
yes_x ■ iiieaber_x_true X <iieMber(x)><true> 
de1_x - de1ete_x_okay X <de1ete(x)><okay> 
1na_x ■ 1nsert_x_okay X <1nsert(x)><olcay> 

pernanent state Is 

snapshot: arrayCT], 
history.suffix: history 

while true do X background process 
begin local coMputatlon 

t: transition :« h1story$de1ete_f1rst() 
If trans1t1on$Match(t. coanltted.del.x) 

then ... X reaove t.argl froai snapshot 
elself trans1t1on$Mteh(t, co«Miitted_1ns_x) 
then ... X Insert t.argl Into snapshot 
end 
end local computation 
end 

Insert ■ procedure(x: T) 

begin entry X begin local computation 
If -hlstorySexIstsChlstorySp.sttbCthls.transltlon), no_x. 
not_changed( del.x) ) 
then X Insert this transition Into h1story_suff1x 

return 
end 
X If no <member(x)><fa1se> transitions can be potentially 
X serialized after this transition, or If there are but 
X the effect of this transition Is overwritten by another 
X CMHltted <do1eto(x)><okay> transition, then this 
X operation can proceed and return. 

retry whenever X end local coi^tttatlon 

~h1story$ex1sts(h1story$p_sttb(th1a_trans-rt1on) . no_x, 
not.cliaaf ed(del^) ) 
end Insert 

not.changed ■ procedure(op: template, t: transition) returns(bool) 

return(~h1story$ex1sts(h1story$d_bet«een(th1s_trans1t1on. t), 

comm1tt«d.ep)) 
end not_changed 

Figure 4-3:An Implementation of a Set RM with the Intention List Paradigm 



100 



delete •• procedure(x: T) 
begin entry 

If ~h1story$ex1sts(h1storySp_sub(th1s_tran8lt1on). yes_x, 

not_changed(1n«_x)) 
then X Insert this transition Into history.suffix 

return 
end 
retry whenever ~h1storySex1sts(h1story$p_sub(th1s_trans1t1on). 

yes_x, not_changed(1ns_x)) 
end delete 

member ■ procedure(x: T) returns(bool) 
begin entry 

If h1story$ex1sts(d_pr1or(th1s_trans1t1on). comm1tted_de1_x, 

not_p_changed( 1ns_x) ) 
then X Insert this transition Into h1story_suff1x 

return(false) 
end 
X If there Is a committed <de1ete(x)><okay> transition 
X serialized before this transition, and there are no 
X Intervening <1nsert(x)><ol(ay> transitions, then false 
X can be returned. 
X The following three clauses are similar. 

If h1story$ex1sts(d_pr1or(th1s_trans1t1on). commltted.lns.x. 

not_p_changed( del.x) ) 
then X Insert this transition Into history.suffix 

return(true) 
end 

If ~h1story$ex1sts(h1story$p_pr1or(th1s_trans1t1on). 1ns_x) 
and "arrayCTlSmemberC snapshot, x) 
then X Insert this transition Into history.suffix 

return(false) 
end 

If ~h1storytex1sts(h1story$p_pr1or(th1s_trans1t1on). de1_x) 
and array[T]$member( snapshot, x) 
then X Insert this transition Into h1story_suff1x 

return(tru«) 
end 

retry whenever 
h1story$ex1sts(d_pr1or(th1s_trans1t1on). 

comm1tted_de1_x, not_p_changed(1ns_x)) or 
h1story$ex1sts(4_pr1or(th1s_trans1t1efl). 

comm1tted_1ns_x, not_p_changed(d«1_x)) or 
(~h1story$ex1sts(h1story$p_pr1or(tlt1s_trans1t1on). 1ns_x) and 
~h1story$ex1tts(h1story$p_pr1or(tli1s_trana1t1on), de1_x)) 
end member 

Figure 4-3, continued 



101 



not_p_changed ' procedure(op: template, t: transition) returns(bool) 
return(~h1story$ex1sts(h1story$p_betweefl(th1s_trans1t1on, t). 

op)) 
end not_p_changed 

end set 

Figure 4-3, continued 
represented by a locaiiy atomic array and a locally atomic iiistory object. When insert 
and delete operations are committed, tiiey are merged into the snapshot in an infinite 
loop. The operation h1storySde1ete_f1rst() returns the committed transition in 
h1story_suff 1x that is serialized before £dl other transitions. Thus the committed 
operations are merged in the global serialization order. 

In the implementation of the insert operation, a test is made to ensure that no conflict 
is created t)efore returning oliay. No conflict is created if there are not any potentially 
subsequent iNMber_x_f a1 se (i.e., no_x) transitions. Notice that the x in the template 
refers to the x in the argument of the inx>ceduFe. Furthermore, even if such a 
transition does exist, no conflict is created if the effects cf the insert operation are 
overwritten by another committed delete_x_olc«y transiti(xi serialized between this 
insert operation and the member operation. This extra filtering is achieved with the 
closure not_chanfled(del_x). If a conflict is created, the current local computation 
is aborted. The proceed condition ^secified in the retry statement is used as a hint 
to determine how tfie a>nftict can be resolved. The implementations of delete and 
member follow a similar pattern. 

Figure 4-4 shows an implementation of the bank SKicount object of section 3.2 with 
the undo log paradigm. InstesKl of merging the committed transitions in an infinite 
loop, the projection in the implementation is modiftod wtren the operation is 
executed. Each transition is paired with an undo operation. The undo operation is 
invoked when the transition is aborted. Changes to the projection are undone. The 
correct undo operation to invdce depefKls on ttie result of tiie CMiginal (H>eration. 



102 



% This example uses the undo log paradlga 

account - resource.nanager Is read_ba1ance. deposit, withdraw 

X transition tenplate abbreviations 

read - read_ba1ance_x X <read_ba1aflce()><x> 

dep ■ depos1t_x_olcay X <depos1t(x)><okay> 

withdr - w1thdraw_x_okay X <w1thdraw(x)><okay> 

successful.update ■ dep or withdr 

Insuf funds " w1thdraw_x_1nsuff Iclent.fundt 

X <w1thdraw(x)><1nsuff1c1ent_fundt> 

peraanent state Is 

projection: real X the balance of the account 
history.suffix: history 

while true do X background process 
begin local coaputatlon 

h1story$de1ete_f1rst() 

end local conputatlon 
end 

deposit " procedure(x: real) 
begin entry 

If "hlstorySexIstsChlstorySp.sttbCthls.transltlon), read) and 
•-h1story$ex1sts(h1story$p_sub(th1s_trans1t1on), 

Insuf .funds, hlgli(x)) 
then projection :■ projection ■•■ x 

X declare undo operation for d«poe1t 
th1s_trans1t1on.ttndo :- un_d«pes1t(x) 
X Insert this transition Into h1«tory_suff1x 
return 
end 
retry whenever 

~h1storySex1sts(h1story$p_Sttb(th1s_trans1t1on), read) and 
•<'h1storySex1sts(h1story$p_sub(th1s_trans1t1on). Insuf .funds) 
end deposit 

un.deposit - procedttre(x: real) 

X this procedure, together with the update of the status of 
X the deposit transition, are exoeuted at a local confutation 
projection :- projection - x 
end un.depoait 

high - procedure(x: real, t: transition) returns(bool) 

return(h1ghest_poss1b1e_ba1ance_at(t) -*■ x ^ t.argl) 
end high 

Figure 4-4:An Implementation of a Btfik Account Ot^ject 
with the Undo l.og F^radigm 



103 



h1ghest_po$sib1e_ba1ance_at - proc«dure(t: transition) return8(r«a1) 
return(proJ«ct1on - daf1n1t«(d«p. t) ■«■ poss1b1e(v1thdr, t)) 
X Unmerge affects of deposits that ara daflnltaly sui»saquent 
X to t and withdrawals that ara tantatlva or potentially 
X subsequent to t. 
end h1ghest_poss1b1e_b*1ftnce_at 

low <■ procedure(x: real, t: transition) returns(bool) 

return(1owest_poss1b1e_ba1anca_at(t) - x < t.argl) 
end low 

1owest_poss1b1e_ba1ance_at - procedure(t: transition) returns(real) 
return(project1on - poss1b1e(dep, t) ■•■ def1n1te(w1thdr, t)) 
end 1owest_poss1b1e_ba1ance.at 

definite ■> procedure (opna«e: teaplate, t: transition) returns(real) 
value: real :• 

for each s: transition In h1story$Mitch(h1story$d_sub(t), 

opnaae) do 
value :- value -•■ s.argl 
end 
return(value) 
end definite 

possible <* procedure (opnaae: teaplate, t: transition) returns(real) 
value: real :- 

for each s: transition tn h1story$aMtch(h1story$p_sub(t). 

opnaae) do 
value :- value •*■ s.argl 
and 
X Add In the values of potentially subsequent transitions. 

for each s: transition In h1story$«atch(d_pr1or(t), 

teRtatlve.opnaae) do 

value :■ value * s.argl 

end 
X Add In the values of tentative transitions, but avoid 
X repeating those abova. 
return(valutt) 
end posslbia 

Figure 4-4, continued 



104 



" * «» »w *i'hw*u <.it ?'«»B^g»»ig» j.»wjtiiii i »p^giwaW ' B 



r«ad_ba1anc« - pro€«4«r«() rttiiriit(r««1) 
b«f In Mtry 
If ~li1ttoryS«J(1ttt(li1ti«ry$#.9r1«r(ili1t.tr«ii«1tion). 

•-ti1ttoryS*x1«tt(li1tt«eytp^«iiMM«terfii.^4or( 



th«n thlt.trMtlflM.tm^ :• iwill j i ft tt tMr t 

X no tiM« It wtcf wry 

X lAMft %MU Xemmm9m iwu klilwry.Mifflx 

r«tiirii(li1ffliMt..#OMt*HJIft1«Miu4^(tlii».tf aatltlofi) } 
•n4 
ratry irti«fw««r 

'^iat«r|^ax1ata<iilatery|pjir1«r(tMa«,traaaft1o«). 

tit»>tf¥«,ai«— tfnl- jHOKI Mi 
~lil«tory«Mlatft(i«ttM^p^^aiiH*4it»lii|Lpr1of'( 



and raa4.fe«f«i«« 

wltMraii >• procwMNK^Cx: r»a1) •tfaa1t(taaiiff1e1«itL.fvntfa) 
kaf la Mti^ 

If liiflMi»tjMMMtlil««*ftlMca.«t<tJi1a.ir«i»tt1«a) < x 
tnaa ttla.tf»«M(Hi«a.i>Mlo :• awH ^r i iiar a 

X MM^t mti traaaUfM IwM Mwl»ry.aiiff1x 

alpwl taan^flclaat wi^ M Xt 
aatf 

If '^1atoryiax«aU(iifataryiiL.HiMiM«^ti»iiifnioa). raa«) atitf 
«-itlt%of>l^axtata|lrftt»ryti^miN^i^»tfil«»4ti»a). witMr. 

To*aat...M»«fki«JMil«i«iyit|«it««««i^a«iittM) ^ x 

X Htmt tiila tfMa4l4«» lata lk^liftL.»»ff <« 
ratitf« 
a«4 

ratry viMNiwfar 

•Hi1a%afi^«ii1a«*(liHt#ryt|L.«iMtli1».tr«M4t1aa). raatf) Mtf 
-litat«i'|rfN»itta<li«t««^i^MiiMH««««w1t««iih wltMr) aa4 
«^4at ariiiK <^^lt^t«|^,jy^»ttiMN Utt?>iii H»<a> > 
tafttarthtra ^MBeaiuilal ig^iBliift aaS' 

«^i 1 a tor|iwrtirt»|»<tiM^itfc.iiit^miir Jtp j^<«r( 

X ratry «IMM Hm tolMM e«il to Jt timt i i i i mitk eartalaty. 
aatf vllMrtMi 

Reur* 4-4, ooniRiiid 



106 



un_w1thdraw - procadur«(x: real) 

projection :- projection ■•■ x 
end un.wlthdraw 

end account 

Figure 4-4, continued 



4.7 Implementation Trade-Offs 

In tills section we discuss several trade-offs that an impiementor of a globally atomic 
object may have to make. Section 4.7.1 compares the two recovery paradigms. 
Section 4.7.2 explores the possibility of implementing globally atomic objects with 
other globally atomic objects. We have not considered this possibility for 
concurrency reasorra. However, such an impiem^itcMon may have sufficient 
concurrency if the underlying glot>aily atc»nic objects am highly concurrent. For 
example, in implementing a bank object which consists of many t)ank accounts, 
implementing the t>ank object with glot>aily atomic t>ank account objects is a viable 
alternative to the parsKligm that we have been describing in this chapter. Finally, 
section 4.7.3 discusses how history objects can be partittoned to reduce the cost of 
accessing ^em. 

4.7.1 Comparison of Recovery Paradigms 

Before comparing the two recovery paradigms, it ^outd be emphasized that the 
recovery par«Jigm m a kx^ choice. Each resource manager can be coded with a 
different recovery parmjigm. In fact, the two peu^lgms can be combined. Both 
snapshots and projections can be maintained, and each operation can derive its 
result from the appropriate objects. The comparison t)etow is bsaed on efficiency 
and progranunability. We have commented in section 4.4.2 that occasionally a 
simple projection and a history object cannot capture the entire state of an object. In 
those cases, either a more complicated ()ro|ection is needed or an intentions list 
paradigm should be used. 

Figure 4-5 shows an implementation of the bank account object using an intentions 

106 



list paradigm. Comparing with tire implementation that uses the undo log paradigm 
in figure 4-4 on page 103, we see that the undo log paradigm is more efficient in 
observing a "recent" state of the object, in other words, when there are few prior 
operations whose effect needs to be "unmerged" from the projection. For example, 
in executing the procedure h1gli«st_poss1b1«_ba1«nc«_at(t), if there are few 
tentative withdr transitions in history.suf f ix and t is a recent transition, then only 
those few transitions potentially subsequent to t and the t^tative withdraw 
transitions need to be "unmerged" from the projection. On the other hand, the 
effects of ail the withdr transitions definitely prior to t and deposit transitions 
potentially prior to t have to be merged with the snapshot. Ckmverseiy, the intentions 
list paradigm is more efficient in observing eun "okJ" state. 

The efficiency of the paradigms also depends on the frequency of aborted 
operations. If an operation is at>orted, the undo log paradigm has to undo the 
changes made to the projection, in addition to wa^ng the ^ort expended in 
changing the projection when the operation is invoked. With an intentions list 
paradigm, tittle work is needed. However, we anticipate aborted operations to be 
uncommon. 

The intentions list paradigm is easier to program with because the progranmier does 
not have to provide undo operations. With the undo log paradigm, undoing is 
needed not only during recovery, but also when the ^ect of operations has to be 
"unmerged" from the projection, such as when detom^ning ttie result to a 
read .balance operation. The fact that ttw state has to bB merged aid unmerged may 

complicate programming. 

o 

4.7.2 Implementing Atomic Objects witli Atomic Objects 

In previous sections we have discussed how to implemmit gtot)aily atomic objects 
using kx^ally atomic otjjects. The implementations are diar»:terized t>y applteatkm- 
dependent synchronization and recovery because a locatly alomic computation is 
committed before the globally atomic computation in which it ex«::utes is committed. 



107 



X This example uses the Intentions list parad1(pi 

account ■ resource.nanager Is read.balance. deposit, withdraw 

X transition abbreviations 

read - read_ba1ance_x % <read_ba1ance()><x> 

dep ■ depos1t_x_okay X <depos1t(x)><okay> 

withdr - w1thdraw_x_okay X <w1thdraw(x)><okay> 

successful.update " dep or withdr 

Insuf .funds ■ w1thdraw_x_1nsuff1c1eBt_fuiids 

X <w1thdraw(x)><1nsuff1c1ent_funds> 

permanent state Is 

snapshot: real 
h1story_suff1x: history 

X background process 
while true do 

begin local computation 

t: transition :- h1story$de1ete_f1rst(h1story_suff1x) 
If trans 1t1on$match(t, comm1tte4.dep) 

then snapshot :■ snapshot * t.argl 
el self trans It 1oiiSmatch(t. eommltted.wlthdr) 

then snapshot :- snapshot - t.argl. 
end 
end local coi^}utat1on 
end 

deposit ■ procedure(x: real) 
begin entry 

If ~h1storySex1sts(h1story$p_sub(th1s_trans1t1on). read) and 
••h1story$ex1sts(h1story$p_sttb(th1s_trans1t1on), Insuf .funds, 

h1gh(x)) 
then X Insert this transition Into hlstory.sufflx 

return 
end 
retry whenever 

~h1story$ex1sts(h1story$p_sub(th1s.trana1t1on), read) and 
~h1story$ex1sts(h1storySp_sub(th1s.trans1t1on) . Insuf .funds) 
end deposit 

high - procedure(x: real, t: transition) returns(bool) 

return(h1ghest.poss1b1e_ba1anee_at(t) -•' x > t.argl) 
end high 

h1ghest_poss1b1e_ba1ance_at - proced«re(t: transition) returns(real) 
return( snapshot - def1n1te(v1tftdr. t) * poistb1e(dep, t)) 
end h1gbest.poss1b1e.ba1ane«_at 

Figure 4-5:An Implementation of a Bank Account CX^ject 
witii the Intention IMParaOigm 



108 



1owest_poss1b1e_ba1ance_at - proc«dur»(t: transition) returns(r«a1) 
return(snapshot - pos$1b1e(w1thdr, t) + d«f 1n1ta(d8p, t)) 
end 1owa$t_poss1b1«_ba1anc«_at 

low - procedure(x: real, t: transition) roturns(bool) 

return(1o«est_poss1b1e_ba1ance_at(t) - x < t.argl) 
end low 

definite - procedure(opna«e: tenplate, t: transition) returns(real) 
value: real :- 

for each s: transition In h1story$natch(h1story$d_pr1or(t). 

coHi1tted_opna«e) do 
value :* value ■*■ s.argl 
end 
return(value) 
end definite 

possible - procedure(opna«e: teaplate, t: transition) returns(real) 
value: real :■ 

for each s: transition In h1story$aatcti(h1story$p_j>r1or(t), 

opnam) do 
value :■ value * s.argl 
end 
return(value) 
end possible 

read_ba1ance - procedure() returns(real) 
begin entry 

If -•h1story$ex1sts(h1story$p_pr1or(th1s_trans1t1on), 
tentat1ve_successfu1_update) and 
~h1story$ex1sts(b1storySp_sub(h1story$p_pr1or( 
th1s_trans1t1on) , tbls.transltlon) . 
coMn1tted_successftt1_ttHate) 
then X Insert this transition Into h1story_suff1x 

return(h1ghest_poss1b1e_ba1attce_at(th1s_trans1t1on)) 
end 
retry whenevor 

~h1story|ex1sts(h1storySp.pr1or(th1s_trans1t1on) , 

tentat1ve_successfu1_update) and 
~h1story$«x1sts(h1story$p_sub(h1story$p_pr1or( 
thls.transltlon). thls.traflsltlon). 
coMSl tted_succ«ssf ul.updatt ) 
end read.balance 

Figuro 4-5, continued 



109 



withdraw ■ proc«dur0(x: r«a1) s1gna1s(1nsuff1c1«nt_funds) 
bagin antry 

1f h1ghaat_pos8lb1e_ba1anca_at(th1s_trans1t1on) < x 
than X Insart thia transition Into history.suffix 

signal Insuff Iclant.funda 
and 

If "hlstorySaxIstsChlstorySp.subCthla.transltlon), raad) and 
•>h1story$ax1sta(h1story$p_sub(th1s_trans1t1on) . withdr, 

1ow(x)} and 
1owast_poss1b1a_ba1anca_at(th1s_trans1t1on) > x 
than X Insart this tranaltlon Into hlatory^sufflx 

ratiirn 
and 

ratry whanavar 

~h1story$ax1sts(h1story$p_sttb(th1s_transit1on), raad) and 
~h1sterySaxl8ts(h1story$p_sub(th1s_trans1t1on). withdr) and 
~h1storyfax1sts(h1story$p_pr1or(th1s_trana1t1on). 

tantat1va_succaasf ul.updata) and 
~h1atory$ax1ats(h1story$p_s«b(h1atory$p_pr1or( 
th1s_trans1t1on). thls^tranaltlofi), 
eoMBi ttad.auceaaaf ul.updat*) 
and withdraw 

and account 

Figure 4-5, continued 

The iocaliy atomic computations are also serialized in a different order than the 
giot}aHy ertomtc computations. An altemative is to construct globally atomic ot>jects 
with glot)ally atomic objects. For example, in^ead erf using locally atomic record 
objects, a bank account can be constructed with globaUy atomic record objects. No 
application-dependent synchronization or recovery ns needed. Application progranns 
can be written as if there were no concurrency or faUures. 

We argued that using glofc}ally atomic record objects to con^ruct glot>ally atomic 
account objects is not concurrent enough when a globsrfiy ertomic computation can 
last a long time. The semantics of the record objects does not allow sufTicient 
concurrency. However, this apixoach of imptmnenting a sM>ally atomic object with 
smaller globally atomic (^tijects may be viable if tiie uiKiert^ng sNot>aily atonic objects 
are absfract objects md their semantics can be i»ed to hncrease omcunrency. 



110 



In this section we will illustrate two different approaches of implementing a bank 

object. A t)ank object consists of many bank accounts. The semantics of a t)ank 

object is described in figure 4-6. Notice the difference t)etween a bank object and a 

bank account object. At first glance, a bank ot^ect may look similar to a bank 

account object tjecause they tx>th support withdraw, deposit, and readfialance 

operations. However, the bank object is in fact capturing the state of a collection of 

t>ank accounts; hence it also supports a transfer operation that transfers funds 

between two accounts and an auditjsum operation that r^ums a sum of the balances 

in all the accounts 

S,: a mapping s from account numbers to real numbers 

i,: undefined for any account number yet 

T|: <deposit(an, x), r,, aXokay, r,, a> s deposit.an.x.okay 

<withdraw(an, x), r,, aXokay, r^, a> s wittKlraw.an.x.okay 

<withdraw(an, x), r,, aXinsufficient.funds, r,, a> s withdraw.an.x.insuf 

<read.balance(an), r^, aXx, r,, a> s roMl.an.x 

<tran8fer(an1 , an2, x), r,, aXokay, r,, a> x transf«r.an1.an2.x.okay 

<tran8fer(anl , an2, x), r,, aXinsufficient.funds, fp 8> » 
t ransf er.an 1 .an 2.x.in8uf 

<audit.8um(), r,, aXx, r,, a> s audit.sum.x 

wliere a is an action, ani's are account numbers, 
X is a positive real number. 

N|(s, deposit.an.x.okay) s s' where s's s except 8'(an) s 8(an)4'X 

N|(8, withdraw.an.x.okay) b s' if 8(an) ^ x, where 8' s s except s'Can) s 8(an)-x 

N,(s, withdraw.an.x.insuf) x 8 if sCan) < x 

N,(8, read.an.x) s s if 8(an) s x 

N,(8, transf er.an 1.an2.x.okay) » s' if 8(an1) ^ x, 

where 8' as except 8'(an1) s s-x, 8*(an2) s 8-*>x 
N.(8, transf er.an I.an2.x.in8uf) a 8ifs(an1)<x 
N|(8, audit.sum.x) x 8 if £|8(an|) ■ x 

Figure 4-6:A ^ate Machine for a Bank Ot^ect 



We will assume that an operation on the ttank oblect last for only a short period of 
time, even though the c^seration may involve more than one account. This is possit)le 
if, for example, the bank object is implemented on a single site. However, we assume 
that there are k>ng computations in this applk:^k)n because some computations 



111 



might access multiple bank objects. 

An obvious approach to implement the bank object is to implement it using locally 
atomic record/array/history objects and the paradigm descrit}ed In this chapter. 
The semantics of the bank object is used to increase concurrency. A different 
approach is to implement the t>ank object out of thet globally atomic bank account 
objects that we have descrit)ed in this chapter. The implementation is simple 
because the account objects are globally atomic and hide the concurrency and 
failures in a system. The complexity is insteiKl hidden in the implementation of the 
account objects. Notice that in this approach the semantics of the account objects is 
used to increase concurrency. 

We will compare these two approaches of implementing a bank object. The 
difference lies in one approach using the semantics of a bank object to increase 
concurrency, while the other using the semantics of the account objects. We will 
argue that concurrency and complexity of the implemerrtations can be comparak>le. 
However, there are several potentially significant differences also. 

4.7.2.1 Two Approaches to Implement a Bank Object 

In figure 4-7 we show a partial implementation of a t)ank object using the 
implementation paradigm descrit)ed in this chapter and some locally atomic record, 
array and history objects. Each t>ank operation is executed as a local computation, 
in which locally atomic record, array, and history ot^jects are accessed. We have not 
shown the locally atomic record and array ejects in figure 4-7 because they are 
hidden in the implementation of the locally atomk: directory otHect. 

Rgure 4-8 shows an implementation of a k)ank object that uses globally atomic 
account objects. Notice that because concurrency and failures are hidden by the 
implementation of the £K:count objects, the impiem^itatk>n in figure 4-8 is r^atively 
simple. 



112 



X This Implementation uses an Intentions list paradigm. 

bank - resource.manager Is read_ba1ance, deposit, withdraw, transfer, 

audit 

X abbreviations for templates 

X <read_ba1ance(an)><x> or <aud1t_sum()><x> 

read.an - read_ba1ance_an_x or aud1t_sum_x 

X <depos1t(an, x)><okay> or <transfer(an' . an, x)><okay> 
depos1t_an_x ■ depos1t_an_x_okay or transfer_an'_an_x_okay 

X <w1thdraw(an, x)><okay> or <transfer(an, an*. x)><okay> 
w1thdraw_an_x ■ w1thdraw_an_x_okay or transfer_an_an'_x_okay 

X <depos1t(an, x)><okay> or <w1thdraw(an, x)><okay> 
successfu1_update ■ depos1t_an_x_okay or w1thdraw_an_x_okay 

X <w1thdraw(an, x)><1nsuff1c1ent_funds> or 

X <transfer(an, an*. x)><1nauff1c1ont_funds> 

1nsuf_funds ■ w1thdraw_an_x_1nsuff1c1ent_funds or 

transfer_an_an'_x_1nsuff1c1ent_funds 

permanent state Is 

snapshot: d1rectory[account_number, real] 
history.suffix: history 

while true do X background process 
begin local computation 

t: transition :- h1story$de1ete_f1rst() 

If trans1t1on$mateh(t, comm1tted_depos1t_an_x) 

then snapshot(t.argl) :- snapshot(t.argl) ■<■ t.argZ 
elself trans1t1on$match(t. comm1tted_w1thdraw_an_x) 

then snapshot(t.argl) :■ sffapsliot(t.arg2) - t.argZ 
end 

end local coiNiutatlon 
end 

deposit - procedure(an: account.niMrtter , x: real) 
begin entry 

If ~h1storySex1sts(h1story$p_sub(th1s_trans1t1on), read_an) and 
~h1story$ex1sts(h1story$p_sub(tiiis.trans1t1on), Insuf .funds. 

hlgh(an, x)) 
then X Insert this transition Into h1story_suff1x 

return 
end 

Figuro 4-7:An Implementation of a Bank Ot^ect 
with tiie Intention List Paradigm 



113 



retry whensver 

~h1storyS«x1sts(h1story$p_sub(th1s_trans1t1on), read_an) and 
~h1story$«x1sts(h1story$p_sub(th1s_trans1t1on). 1nsuf_fund<) 

end deposit 

high - procedure(an: account_n unbar, x: real, t: transition) 

returns(bool) 
return(h1ghest_poss1b1e_ba1ance_at(an. t) -t- x > t.argl) 
end high 

h1ghest_poss1b1e_ba1ance_at * procedure(an: account_nuMber, 

t: transition) returns(real) 
return(snapshot(an) - 

def1n1te(Mlthdraw, an, t) + poss1b1e(depos1t, an, t)) 
end h1ghest_poss1b1e_ba1ance_at 

aud1t_suR " procedureO returns(real) 
begin entry 

If ~h1story$ex1sts(h1story$p_pr1or(th1s_trans1t1on), 
tentatlve.successful.update) and 
~h1story$ex1sts(h1story$p_sub(b1story$p_jir1or( 
this.transltlon), this.tranaltlon), 
coMi1tted_successfu1_update) 
then r: real :- 

for an: account_niiaber In 

d1rectory$e1e«ents(snapshot) do 
r :- r ••■ ba1ance_at(an. th1s_trans1t1on) 
end 
X Insert this transition Into h1story_suff1x 
return(r) 
end 
retry whenever 
~h1storySex1sts(h1story$p_pr1or(th1s_trans1t1on), 

tentat1ve_successfu1_update) and 
~h1story$ex1sts(h1story$p_sub(h1story$p_pr1or(th1s_trans1t1on), 

th1s_trans1t1on). C0Mi1tted_successfu1.update) 
end audlt.siM 

definite ■> procedure(opna»e: teaplate, an: account_nMiber , 

t: transition) returns(real) 
value: real :■ 

for each t: transition In h1story$«atch(h1story$d_pr1or(t), 
cotMa1tted_opnane_an_x) do 
value :■ value ■«■ x 
end 
return(value) 
end definite 

Figure 4-7, cx>ntinued 



114 



possible ■■ procedur«(opnaaw: templat*. an: account.ntmbar. 

t: transition) raturns(r«a1) 
value: real :• 

for each t: transition In h1storySauitch(h1storySp_pr1or(t), 

opnaae.an.x) do 
value :■ value + x 
end 
return(valtte) 
end possible 

ba1ance_at ■ procedure(an: account.nuiMier . t: transition) retyrns(real) 
return(snapsbot(an) - 

def1n1te(w1thdra«. an, t) * deflRlte(depos1t. an. t)) 
end balanee.at 



end bank 

Figure 4-7, OMitimjed 

Depending on how the globally atomic account objects are impleniented, our bank 
application may or may not have enough concurrency. An af^iciAion that uses a 
combination of the implementation in figure 4-8 with the implementation of globally 
atomic bank accounts in figure 4-9 is protMdMy not concurrent enough in a system 
with tong computations, sim^ no semantics of the i^idriton has been utilized. On 
the other hand, if the application uses the gkrtM% atomk; bank account 
implementations described in f^res 4-4 and 4-5, which make use of the semantics 
of a bank account, the resulting appllcatton aiiows much mom concurrency. 

Notice that there is some similarity between figures 4-5 and 4*7. For example, the 
depos 1 1 operations in ttie figures are irimost idwtttoirf. However, part of this similarity 
is due to clever encodirn) of the transition t&mipMbaB. The read_an transition 
template in figure 4-7 stands for either a read_b«lance_x transition or an 
audlt_swi_x transition tiuA invohws an, whereas the read ti'ansition template in 
figure 4-5 stands for a read.bal anee^x transition onty. 

Figure 4-10 depicts tfre two different approaches to imfMement a gi(^Mdly atomic bank 
object. Ncrtice that both Aj^roach 1 and Apixoach 2 use the implementation 



115 



bank * resource manager Is deposit, withdraw, read_bAl&nce, audit, 

transfer 

permanent state Is 

dir: d1rectory[account_number, account_resource_manager] 
X this Is a directory that maps account numbers to the account 
X resource manager that lnqilemefits the account. To simplify 
X our example, we assime all Input account numbers are valid. 

deposit •• procedure(a: account_niMri>er, r: real) 
d1r(a).depos1t(r) 

X d1r(a) looks up the resource manager corresponding to a. 
X The syntax "resource.manager.naiM.procedure.naaMC arguments)" 
X Is used to call a procedure In another resource manager, 
end deposit 

withdraw •• procedure(a: account.number, r: real) 

s1gnaTs(1nsuff1c1ent_fuaiis) 
d1r(a).w1thdraw(r) reslgnal Insufflclent.funds 
X The reslgnal statement catches any Insufflclent.fund 
X signal from the withdraw procedure of the bank account object 
X and reslgnals It to the caller of this withdraw procedure, 
end withdraw 

read_ba1ance •• procedure(a: account.number) returns(real) 
return(d1r(a).read_ba1anee()) 
end read_ba1ance 

aud1t_sum ■ procedure() return (real) 
result: real :• 

for an: account.nwrtier In dlrectoryfeleiMnts(dlr) do 
result :- result * read_b«1*n6*(«n) 
end 
return(resttlt) 
end aud1t_SH« 



transfer - procedure(from, to: account.niMriier . amount: real) 

s1gna1s( Insuf f 1c1ent_f UAtft) 
w1thdraw(from. amount) realgnal Intufflelent.funds 
depos1t(to. MMunt) 
end transfer 

end bank 

Figure 4-8: A Simple Imf^^nentation of a Bank Object 



116 



account > rasource manager Is deposit, withdraw, read_ba1ance 

X Procedures exported 

permanent state Is 

state: global 1y_atom1c_record[ba1ance: real, ....] 

deposit - procedure(r: real) 

state. balance :- state. balance + r 
end deposit 

withdraw - procedure(r: real) s1gna1s(1nsuff Iclent^funds) 

If state. balance < r then signal 1nsuff1c1ent_funds end 
state. balance :•• state. balance - r 
end withdraw 

read_ba1ance - procedure() returns(reel) 
return(state. balance) 
end read_ba1ance 

end account 

Figure 4-9: A Simple Implementation of a Bank Account Ot^ject 



paradigm descrit)ed in this chapter, though at different levels of abstrEK^tion. 

4.7.2.2 Comparison of the Two Approaches 

In this section we compare Approach 1 and Approach 2. The two approaches are 
comparat3le in complexity and concurrency. However, there are also some subtle 
differences. The complexity of Approach 1 is in tiie imi^ementation of the giot)ally 
atomic bank objects using k)C£dty atomk: objects, whereas the complexity of 
Approach 2 is in the implemwitation of the globally atomk: t>ank account objects. 
Building globally atomic bank objects from globally atomk: «;count objects is a 
simple task, because the necessary synchronization and recovery have been 
implemented with the underlying glot)ally atomic account objects. 

Concurrency and Complexity 

It is not obvious whether Approach 1 or Approach 2 to more de8irat>ie. In an 
implementation that follows Approach 1 (figure 4-7), the transfer and auditjsum 
operations can avokl creating any conflicts with each another as totng as the other is 



117 



Globally atomic 
bank object 



Locally atomic 
record and 
history ot^ects 



Approach 1 



Globally atomic 
t>ank object 



Globally atomic 
account objects 



Localty atomic 
record uid 
history ot^ects 



Approach 2 



Figure 4-10:Two Different Approaches of implementing a 
Qlobaity Atomic Bank Obiect 



finished but maybe tentative. Because a transfer oeauditjsum operation can be part 
of a much longer computation, thto period of being finished but tentative can be quite 
long. The concurrency is due to the semantics of audit. sum, which only requires the 
result returned to be a sum of the balitfices, and tftat of transfer, which Iteepe the 
total balance constant ^though it changes bvlividuai baiwices. As a result, when 
one of the qserations is completed, the other operaUon can (xoceed even when the 
first operation is not committed. 

If the bank object is implemented using giobaily atomte account ottjects, the transfer 
and audit jsum operations will be translated into wittidraw/def>osit and readJMilance 
operattons on the bank account ol^ects. These operalk>n8 interfere with one 
another and cause conflicts to be created even titer the higher-tev^ operations at 



118 



the bank object are already finished. 

One may be tempted to implement the transfer and audit sum operations with special 
versions of the lower-level operations. In fact, a possible implementation is to define 
the globally atomic bank accounts with the semantics in figure 4-11. 

S,: [si , s2] where si and 82 are real numbers 

«r[0,0] 

T.: <depo8it(x), r,, aXokay, r,, a> s deposit.x.okay 

<withdraw(x), r,, aXokay, r,, a> s withdraw.x.okay 

<withdraw(x), r,, aXinsufficient.f unds, r,, a> a withdraw.x.insuf 

<tdepo8it(x), r,, aXokay, r,, a> s tdeposit.x.okay 

<twithdraw(x), r,, aXokay, r,, a> = twithdraw.x.okay 

<twithdraw(x), r,, aXinsufficient.f unds, r,, a> « twithdraw.x.insuf 

<read.balanceO, r,, aXx, r,, a> s read.x 

<aread.baianceO, r,, aXx, r,, a> s aread.x 

where a is an action, x is a positive real number. 

N,([8l , s2], deposit.x.okay) s [si + x, 82 -*■ x] 
Nj([s1 , 82], tdeposit.x.okay) s [si + x, s2] 
N,([8l , s2], withdraw.x.okay) = [sl-x, s2-x] if si > x 
N,([s1 , s2], withdraw.x.insuf) s [si , s2] if sK x 
N,([s1 , s2], twitltdraw.x.okay) a [si -x, 82] if si > x 
N,([s1 , s2], twithdraw.x.insuf) > [si , 82] if s1< x 
Nj([s1 , s2], read.x) s [si , 82] if si s x 
N|([s1, s2], aread.x) s [si, 82] if 82 s x 

Figure 4-1 1 :A Specialized ^ate Machine for a Bank Account Object 



Special operations tdoposit and twithdraw are provkJed for the imptementation of 
transfer, and an aread operation is provkted for audlt.sum. In essence, each bank 
account keeps track of two "balances." The first balance is the normal one. The 
second "balance" is updated when the update is not Invoked by a frans/er operation. 
The second balance is read to calculate the sum <^ the t)alan(%s. As a result, no 
conflicts are created between an audit jsum operation and a transfer operation. 

This technique does not work in general situations tiecauae the cost oi keeping extra 
state information can be prohit>itive. For example, suppose a database of emptoyee 
records is partitioned among several sites. The e^icatkm provkles operations to 



119 



transfer employee records from one partition to anottier, update information in the 
employee records, and to evaluate queries. The interference between the transfer 
and query operations poses a problem similar to the interference t>etween transfer 
and audit.sum in the bank application. However, keeping an extra copy of an 
employee record at the old partition when it is tran^erred does not seem to be 
acceptable. Not only is extra storage required, updating the emptoyee records 
t}ecomes more costly also. A more appropriate solution in this example would be to 
allow the partitions to return a superset of the records in that partition. The 
coordinator of the query can ignore redundant records collected from the partitions. 
If a record is being transferred from one partition from another, tx>th partitions can 
return the record b^ore the transfer computation is finalized. When a record is 
deleted, both partitk>ns must be informed. 

Although the examples above do not show that concurrency is necessarily 
decreased when glotmliy atomic objects are imptemented with other glot)atly atomic 
objects, they do illustrate that the semantics of the tower-level gk>t>aliy atomic objects 
have to be customized. The customization increases the complexity of imptementing 
a glok>ally atomic e^ject. 

Reliability and Efficiency 

A possible disadvantage of implementing the bank object ^th locally atomic objects 
is the centralization of synchronizatk>n and recovery information. When compared to 
an implementation in which the history objects are distributed among many account 
objecte, the history ot^ect used by a t>arTk object contains more transitions and is 
more expensive to access, in addition, the reliatsility of the appiicatton can be 
reduced l3ecause its functtoning d^^ends on the availat>ility of the centrsJized history 
object of the bank otjject. A possible solution to overcome these disadvantages is to 
partitton or replicate the state (directory) of tiie bank ob^ct. We will describe how 
history objects can be partitkHied and/or r^k:atod in the next section. 

Another poss(t>le problem of implementing gk>i3aiiy atomic objects with locally atomic 

120 



objects is the limitation in the length of locally atomic computations. Since locally 
atomic objects are implemented with other Jocally atomic objects, such as locally 
atomic arrays or records, the lengths of the locally atomic computations have to be 
kept short to minimize the cost of conflicts created in eKZcessing locally atomic 
objects. Keeping locally atomic computations ^lort is not always possible, especially 
when a locally atomic object may be partitioned or replicated. To minimize the cost 
of these conflicts, we can have a multiple-laired model of atomicity, instead of the 
dichotomy of local atomicity and giotial atomicity. A layer i atomic object can be 
implemented with a layer I -•• 1 atomic object. The semantics of the objects in each 
layer can be utilized. For example, a layer 1-1 atomic banit object can be 
implemented with a layer I atomic history object and a layer I atomic bavk account 
object, which can in turn t)e implemented with a layer I •*■ 1 atomic hi^ory object and 
a layer I ■*• 1 atomic record object. 

4.7.3 Partitioning and Replicating History Objects 

When computations are long, their transitions may remain tentative and be kept in a 
history object for a long period of time. Performance can become a prot>iem when 
there are too many transiticms in a history c^jject. An obvkjus solution to this problem 
is to partition history objects into smaller history objects. 

In our prevk>us program examples, we ^sume one history cMecX is shared by all the 
atomic objects impiementod in a resource manager. This is not necessary and can 
be changed by having multiple history otqects declared in the resource manager, 
with history operaticms specifying the history object being operated on explicitly. 

More complicated schemes of partitioning the history object are possible. For 
example, if an operatkm x is only interested in a «jteet of the different types of 
tran»tions, a sub-history can be created containing only tttose transitions. The cost 
of inserting a transition, which happois once, may become higher because the 
transition may have to be Inserted Into several sub-h^orles. However, the cost of x 
accessing a history object is lowered becmise there are probed>ly fewer transitions in 



121 



the sub-history in which x is interested. 

For example, the history of a set object can be partitioned according to properties of 
the items involved. For example, if a set object is a set of integers, the history object 
can be partitioned according to the range of values df the arguments. A more 
complicated example can tie illustrated with the history object in the implementation 
of a globally atomic employee file object. The application may decide to partition the 
history object and the snapshot/projection objects according to the depiartment that 
a transition is related to. For example, If a transition involves an employee in 
Department X, then only the partition of Department X needs to be accessed. When 
an employee is transferred from Department X to Department Y, a transition is 
inserted into each of the partitions of the two departonents. If a query involve 
potentially every department, all the partitions need to be accessed. 

In these examples, the locally atomic and logically centralized history object is 
implemented with locally atomic history partitions. The semantics of the partitions 
reduces the numt}er of partitions that need to be «:cessed. If only a few partitions 
are accessed, the cost of accessing the entire history is reduced and tfie operation 
can proceed even when some partitions are not available. 

In [20] a history object is partitioned and replicated for availability reasons. The 
history object is not partitioned according to properties oi the transitions but rather 
the availability of the replicas (pfifftitions). Each transition has ah initial quorum aid a 
final quorum. When ttie history object is read, an initial quorum of replicas is read to 
guarantee that every transition reieviuit to the current operation is contained in at 
least one of the replicas. When a transition is inserted into the history object, a final 
quorum of replicas is accessed. For example, in determining whether conflicts are 
created for an observe operation, <Ah&r cAiserver transitions are irrelevant 
Consequently, the replicas read rmiy ncrt overlap witii the replicas updated when 
previous observer transitions am inserted. 

A simpler scheme of replicating the entire history ot>ject can be used to increase 



122 



availability, though not performance, over an un-repiicated implementation. 
However, tsecause a history object is usually both resMj and written, a read-one-write- 
all algorithm will not increase availability. A slightly more complicated resKl-write 
quorum scheme [16] is needed. 

Another way of partitioning the history object can be illustrated by the example in the 
previous section. By implementing the bank object with glot)aity atomic t}ank 
account objects, no history needs to be kept for the bank object; rather, the history 
information is partitioned among the account objects. This partitioning is simpler 
than those described above because no centralized image is necessary. 
Unfortunately, as the example has illustrated, this partitkining may cause a loss of 
concurrency. 

Finally, there is a possibility of avokling the co^ of SK^cessing the history object 
altogether in some applk^ions. Conskler the semi-queue ot^ect specified in figure 
4-12. 

S,: sets of items (we assume items enqueued are unique) 

l,:0 

T|: <enqueue(x), r,, aXokay> a enqueue.x.okay 

<dequeue(), r^, aXx, r,, a> s dequeua.x 

<dequeue(), r^, aXempty, r,, a> s dequeue.empty 

where a is an action, x is an item. 
N,(s, enqueue.x.okay) > a U {x} 
N,(s, dequeue.x) s a • x if x € s 
N|(s, dequeue.empty) s s if a s 

Figu re 4> 1 2 : A State Machine for a Semi-Queue 

An implementation using an intentions list recovery paradigm can be found in figure 
4-13. In the implementation of the dequBue qseration, we find that when there are 
items in the snapshot, the history object has to be accessed to make sure the items 
have not been dequeued by previous dequeue (Hierations. To avoki this access, the 
snapshot object can be partitioned into two arrays, say al shxI aZ. The idea is to put 
ail the items which are definitely not dequeued into al and items which may have 



123 



X This exaiiipls us«s the Intentions list paradlg*. 

semi q[1 tern] ■ resource.manager Is enqueue, dequeue 

X transitions In history suffix 
X dequeue_x - <dequeue()><x> 
X dequeue_enpty - <dequeue()><9aipty> 
X enqueue_x_okay * <enqueue(x)><okay> 

perManent state Is 

snapshot: array[1tea], 
h1story_suff1x: history 

while true do 

begin local cooiputatlon 

t: transition :- h1story$de1ete_f1r«t() 
If trans1t1on$«atch(t, coNi1tted_dequeue.x) 
then ... X re«eve x froa snapshot 
elself trans1t1onSMtch(t. coMa1tted_enqueue_x_okay) 

then ... X Insert x Into snapshot 
end 
end local coaputatlon 
end 

dequeue - procedure() returns(lteH) signal s(e«pty) 
begin entry 
for x: Itea In array[1tea]Se1eaMnts(snapshot) do 

If ~h1story$ex1sts(deqtteae_x) then return(x) end 
end 

If h1story$ex1sts(h1story$d_pr1or(th1s.trans1t1on). 

co«Rltted_enquette_x_okay, not.usetf) 
then X Insert this transition Into histery.suffix 

return(x) 
end 

If ~h1story$ex1sts(h1story$p_pr1or(th1s_trans1t1on). 

enqtteue_x.okay, not.d.used) and eapty_snapshot( ) 
then X Insert this transition Into histery.suffix 

signal eaipty 
end 

retry whenever 
~h1story$ex1sts(h1story$p_pr1or(th1s_trans1t1on). 

ten tat 1 ve_enq«ewe.x_okay ) 
and ~h1story$ex1sts(tentat1ve.de4ii«««.x) 

end dequeue 

Figure 4-1 3:An Impiementatton of a Semi-Queue Object 



124 



enqueue - procedure(x: Iten) 
begin entry 

If ~history$ex1sts(h1story$p_sub(th1s_trans1t1on) . 

dequeue.eapty) 
then % Insert this transition Into hlstory.sufflx 
return 
end 
retry whenever 

~h1story$ex1sts(h1story$p_sub(th1s_trans1t1on), dequeue_e«pty) 
end enqueue 

not_used - procedure(t: transition) 
x: Item :- t.argl 

return(~h1story$ex1sts(h1story$d_sub(t) . dequeue.x)) 
end not.usod 

not_d_used - procedure(t: transition) 
x: IteM :<■ t.argl 

return(~h1story$ex1sts(h1story$d_sub(t), coNiltted.dequeuo^x)) 
end not_d_used 

•mpty.snapshot ■ procedure() returns(bool) 

for x: Itea In array[1te«]$e1e«ents(snapabot) do 
If •-h1storySex1sts(coMa1tted_d«4y«u«_x) 
then return(false) 
end 
end 
return(true) 
end eiepty.snapshot 

end sealq 

Figure 4-13, continued 
been dequeued into aZ. When a comnnitted enqueue transition is merged, the item 
can be inserted into a2 if there is a subsequent dequeue transition of that item, and 
into al otherwise. When the dequeue operation is invoked, it can enumerate al first, 
if there is an item in al, it can be deleted from al, inserted into aZ, and returned to the 
caller, without ever accessing the history object If no items are found in al, aZ can 
be searched. Occasionally, a dequeue operation may be i^rted after the item hee 
been moved into aZ. The item may stay in aZ without affecting the correctness of the 
implementation; a background process can move such items back to al. 



125 



4.8 Conclusion 

In this chapter we have described programming paradigms that an implementation of 
an atomic object can follow. These paradigms simplify the writing of application- 
dependent synchronization and recovery code. With simpler code, arguing the 
correctness of an implementation becomes easier. In paulicular, we introduce the 
notion of locally atomic ot^ecte and locally atomic computations. Synchronization 
and recovery are partitioned into those performed by the locally atomic ot>iects and 
those perfomoed by the implementation of the atomic object. This partitioning helps 
the programmer convince himself that the implenmntsftion is correct. 

In this chapter, we have also introduced the use oi history objects, which capture all 
the relevant local information needed t^ an otHect to detennirte whether conflicts are 
created. The interface provided by these hi<^ory c^Hects merftes the underlying 
concurrency control erigorithm transparent to the programrrars. This transparency 
provided by the history ot>jects, togetf>er with the transparency provided by the 
conflict model, allow the programmer to design ttie functionality and program the 
implementations of an apii^ication without having to understarKi the details of the 
underlying concurrency control algorithm. 

For recovery, w« have discussed an intentions 11^ paradigm and an undo log 
paradigm. By imposing constraints on how an operation may mutate the locally 
atomic objects, the recovery activities become a more stn^tured process. 

We have presented several program examples and illustrated the use of the 
paradigms we introduced. 

Finally, we have discussed several implementation strategies and their trade-offs. 
Rrst, there is the local choice of the recov«7 paradigm. Second, glot>ally atomic 
objects can be implemoited using localty atonUc crfsiects or other glot>ally atomic 
objects. Rnally, tfie cost of accessing history objects can be minimized by various 
ways of partitioning them. These options provicte opportunities to customize the 



126 



implementation to specific needs. 



127 



Chapter Five 
Concurrency Control Algorithms 



In our conflict model and programming interface, each atomic object is assumed to 
posses some knowledge of the serialization order arul (H^eration outcomes. Based 
on this knowledge, an ob\eCt can express conflict conditions without knowing the 
details of how the serialization order and operation outcomes are arrived at. In this 
chapter we discuss how the objects arrive at a serialization order through a 
concurrency control algorithm. The protocol that different entities in a distributed 
system use to arrive at a consensus of the outcome ctf a computation is called a 
commit protocol. Many papers [1 7, 37, 52] have been written on the subject and we 
will discuss it only briefly at the end of this chapter. 

This chapter seeks to fulfill two goals. Rrst, we will show that the programming 
interface that we present in Chapter 4 can be implemented on top of a large dass oX 
concurrency control algorithn^. In parttouiar, we show how the history operations, 
such as psub and djarior, can be implemented. We will also show how the rttry 
statement can be implem^ited so that the appropriate actions are taken when 
conflicts are created. 

Second, we will argue that in some situations the concurrency of a system can be 
significantiy affected by how the serialization cxder is determined. In deriving conflict 
conditions, we find that whether a conflict arises ctepends on the functionality of the 
operations of an applicatton an6 the tocal knowledge of the serialization order and 
operation outcomes. Previous chapters have focused on how the functtonatity of an 
operation determines the likelihood of confik:t8. This diapter shows that there are 
spedal atuations in which some concurrency control algorithms can reduce the 
likelihood of costly conflicts significantiy when compared to other algorithms. For 



128 



example, suppose long computations are rare in a system and it is unlikely for two 
long computations to overlap their execution. Given these conditions, it may be 
possible to develop concurrency control algorithms that distinguish between long 
computations and short computations so that only short computations will be 
restarted or cause delays in other computations. Given that, the overaH cost of 
conflicts in these algorithms can be much smaller th^t that incurred by existing 
algorithms. One of the contributions of this thesis is the design of two novel 
concurrency control algorithms that are adapted to syirtenns with long atomic 
computations. 

Section 5.1 briefly descrities some of the existing concurrency control algorithms 
and compares the likelihood of costly oMiflicts in these algorithms. Section 

5.2 descrit)es two novel concurrmcy control srigorithms and explains the situations in 
which these algorithms can reduce the overall cost of conflicts significantly. Section 

5.3 descrit)es the implementation of the programming interface in Chapter 4 given 
that different concurrency control algorithms can be used underneath. Section 

5.4 discusses commit protocols briefly. 

To separate our consideration of concurrency control algorithms and the 
functionality of an application, we will use ttie terms "ot>8erver" and "mutator" in this 
chapter to refer to two classes of operations. The furK:tk}naiity ctf the first class 
observes the abstract state of an object. The second class mutate the ak)stract 
state. For example, a read balance operation te cm observer, a deposit operation is a 
mutator, a ^ccessful withdraw operation is tx>th because it ot>serves that there are 
sufficient funds and mutates the abstract state. To Amplify our discussion, we will 
assume that conflicts are created when: 

1 . an observer may be serialized after a tentative mutator, or 

2. a mutator may be serialized before as\ observer prevkHisly invoked. 

This is not true in all cases, such as when the observer is a withdraw operation and 
the mutator is a deposit opers^on. No conflicts wouM be created if there were 



129 



sufficient funds for tlie withdrawal regardless of tfie deposit. 

Also, we exclude the possibility of parallel sub-actions in our description of 
concurrency control algorithms. A computation executes with only one locus of 
control and sub-actions within a computation are scNialized by the order they 
execute. In most cases, it is straightforward to extend the algorithms to handle 
parallel sub-actions. We will give ttrief explanations erf how an algorithm can be 
extended when the extension is not c^ious. 

5.1 Concurrency Control Algorithms 

The goal of a concurrency control algorithm is to ensure that a serialization order 
among the committed computations exists. It eiao determines the actions that need 
to be taken when a conflict arises. 

Many different concurrency control algorithms have been proposed. Some of 
them [48] use the order in which computetions are started as a serialization order, 
some [1 7] use the order in which computations commit as a s^laiization order. The 
actions that are teriten when conflicts arise depend very much on how a s^iaiization 
order is arrived at. In sections 5.1 .1 and 5.1 .2 we enumerate some of the well-known 
concurrency control algorithms that have been proposed in the literature and discuss 
the likelihood of costiy conflicts in these algorithms. Enumerating all the algorithms 
proposed in the iitwature would be impossible. However, the performance of the 
algorithms otescribed in section 5.1.1 an6 5.1.2 is representative of a large class of 
algorithms. 

5.1.1 Static Concurrency Control Algorithms 

In general, concurrency control algorithms can be classified according to the time 
that the serialization order is determined. In static algormims, the serialization order 
is determined at the beginning of a computation. When a computation is started, a 
unique timestamp is i»sociated with the computation, asKi Vhe value of the timestamp 



130 



determines the serialization order^^. In the r^t of this section, we will use Reed's 
multi-version timestamp algorithm [48] as an example of static concurrency control 
algorithms. In his algorithm, computations with larger timestamps are serialized after 
computations with smaller timestamps. 

Recall that conflicts are created under two types of situations: 

l.when a mutator m1 is invoked and it may be serialized before an 
observer o1 , or 

2. when an observer o2 is invoked and it may be serialized after a tentative 
mutator m2. 

In [48], the mutator ml is refused and the computation that invokes ml is restarted 
with a larger timestamp. Restarting a computation with a larger timestamp is the only 
way to change the serialization order. The c^}server o2 te delayed until the tentative 
mutator m2 is finalized. 

An alternative to refusing ml is to abort some erf the previously invoked operations, 
such as the ot)server o1 . However, this m not always possible as those operations 
may have committed. Furthermore, a raoo condltkHi may devekip in deciding to 
commit or sdxMl thoee (^aerations. The sites making the decisions must be 
synchronized. 

The concurrency problem created by the formatton of conflicts can be evaluated with 
the likelihood of formaUon and costs of the C(Kifllcts. The likelihood and cost of a 
conflict can be classifksd according to the two tfpee erf situations in which it is 
created. Beskles depending on the functk>natlty of the operations of an appttoatlon, 
the likelihood thed the first type erf conflicts are creirted In a static algorithm depends 
on whether operations are arriving aA an ot)^ct in the predetermined static order. 
The more CH}erations arrive in that order, the toss Hkely ft Is that the first type of 
conflk:ts are created. However, C(M)sklering tMt ttie time between when the 



^^or panM sub-actions, it suffices to extend tie timestamps to non-overlapping time ranges, with 
suti-actktnssiAxJividing the parent's time rwige. Fordetaito8ee[48]. 



131 



computation iDegins (the timestamp assigned) and when the object is accessed has a 
larger variance in our system than in systems with only short computations, we may 
have a significantly larger percentage of operations arriving in an order that differs 
from the static serialization order. In particular, ion operation from a remote caller 
may find that many local computations with larger timestamps have t)een executed, 
and probably committed, during the time the call travelled from the caller to the caliee 
site. Obviously, when a computation may remain tentative for a long period of time, 
the second type of conflicts is also more likely to arise in a system with long 
computations than in a sy^em with only short computations. 

In static algorithms, the cost of the first type of conflicts is a restart of the refused 
computation. This is potentially disastrous as the refused computation may have 
executed for a long period of time. In addition to kxrt work, restarts also cause 
delays. If the top-levei actk>n of the refused computation is executed at a remote site, 
the restart is likely to be expensive: it adds an extra round-trip delay at least. Note 
that when a conflict of the first type is created in a static algorithm, the operation that 
creates the conflict is likely to be invoked from a remote site. It is also possit>te that a 
restarted computation may encounter another conflict and have to be restarted 
again. 

The cost of the second type of conflicts deperuls on how long the tentative operation 
m2 remains tentative. An alternative to delaying the absenier o2 is to restart the 
computation that invokes o2 vMth a smaller timestamp. It is not always the most 
appropriate action as the computatkin may encounter some other conflicts of the first 
type because of the smaller timestamp. However, concurrency may be increased if: 

1. the computation does not invoke mutator operations and cannot create 
the first type of conflicts, 1^ 

2. the computation has only been started recently arnl restarting it has a 
small cost, and 

3. the mutator nt2 is invoked t)y a k>ng computation and may not be 
finalized until after a significant delay. 

If the condlttons de8Crit>ed aix>ve can be evaluated eX run time, the concurrency 



132 



control algorithm can minimize the cost of the conflict accordingly. 

Although the likelihood of formation and costs of conflicts are generally higher in a 
system with long computations than in a system with only short computations, there 
are some situations under which we can expect the two kinds of systems to have a 
similar concurrency level. In a static algorithm, short computations are less likely to 
create the first type of conflicts than long computations. This is tsecause they are 
less likely to encounter operations with larger timestamps already executed. The 
cost of restarting a short computation is also tower. Short computations may include 
single-site computations arKl computatiorra that execute within a tightly-coupled 
group of sites that can comnHinicate with short delay. Consequently, if all the 
mutator computations are ^ort and only read-only computations are long, short 
computations can usuatiy succeed without incurring costly conflicts. Moreover, 
because read-only computations are reiver restarted unless a restart is cheaper than 
a delay, the long read-only computations are only (telayed by short mutator 
computations. 

5.1 .2 Dynamic Concurrency Control Algorithms 

In dynamic algorithms, the s^alizatton order is determined during the execution of 
the computations at the objects. Typically, the serieriization order between two 
computations in a dynamic ^gorithm is determined by the order in which they finish 
accessing the last object. The nK>ment immediately after the la^ ot^ect is accessed 
is called a computatton's locked point [6], which, to simplify matters, can be equated 
with the moment at which the computi^ion te finaUzad. 

Dynamic concurrency control algorithms have the property that an operation is 
always serialized after all other finalized operations. Other tentative operations can 
be either prior or subsequent to this operation in the aeriiriization order. Given these 
properties and that a conflict of the first type is creirted (i.e., a mutator ml is invoked 
and it may be serialized before an obBenter o1). ttm observer o1 must be tentative. 



133 



Usually the mutator m1 Is delayed until the observer Is finaiizedJ^ Delaying the 
mutator eliminates the possibility that it can be serialized before the observer. When 
a conflict of the second type is created (i.e., an observer o2 is invoked and it may be 
serialized after a tentative mutator m2), the observer o2 is delayed until the mutator 
m2 isfinalized.^^ 

Occasionally, several computations may be deadlocked, each waiting for another to 
finalize. A deadlock detection algorithm [43] can be used to detect and break the 
deadlock by restarting sonie computation in the cycle. After the victim computation 
has been chosen, one of its actk>ns that causes the delay of other actions can be 
aborted and its parent action can be notified. If the parent action has not proceeded 
beyond the end of the victim action (e.g., the victim action has not finished, or the 
parent action has created several parallel sub-actions and is waiting for all of them to 
finish), the parent eK^tion can abort the victim action and start a new instance of it. 
Otherwise, the parent action t)ecomes a vtetim actk>n atso. The process is repeated 
until the top-level action is reached. The top-level £^k>n coukl not have been 
committed since it is deadlocked. 

The likelihood of formation of both types of confltets in a dynamic algorithm 
increases with the numk)er erf tentative operations tA an ot^ect. Unfortunately, the 
likelihood will be higher in a system with long computations than in one with only 
short computations. This is t)ecauae the time between when an object is accessed 
and when the computation is finalized is, in gmieral, kmger in a system with long 
computations. To make nratters worse, the delay cmised t)y a conflict adds to the 
length of a computation and make the expected number of tentative operations even 



An alternative is to delay the commltnwnt of the mutator until the observer is finalized. In this 
alternative, the mutator operation can proceed but cannci commit un& ttie obeerver is finalized. 

18 

An alternative is to delay the commitment of the obeerver until ttie mutator is committed. The 
ot>server can proceed txjt may be aborted later if the mutator is aborted. Depending on the likeiihood of 
a computation being aborted, this alternative may or may ncA improve concurrency. However, ttte 
improvement is not significant because the otwerver has to wait for ttie mutator to commit in any case. 



134 



larger. 

In dynamic algorithms, the cost of a conflict is a possibly long delay. Moreover, when 
the probability of being delayed is high, there is a pos^bility of cascaded delaying: a 
tentative operation delaying other operations is in turn delayed by another tentative 
operation. 

In addition to cascaded delaying, there is also the cost of decKJIocks. There Is some 
empirical evidence [18] that deadlocks are uncommon in systems with short 
computations. However, it is unclear whether this is still valid when computations are 
long. When a deadlock occurs, there is the cost of detection, which usually Involves 
passing messages around [43], and the cost of restarting a victim SK^ticn. 

5.2 Improving Concurrency with Concurrency Control Algorithms 

In this section we suggest some novel concurrency control algorithms. We will show 
that these algorithms can reduce the likelihood that costly conflicts will arise in a 
system with long atomic computations. In particular, we will describe a hierarchical 
conflict algorithm that preserves the advantages of a static algorithm over a dynamic 
algorithm (short computations are less likely than long computations to encounter 
conflicts and less expensive to restart, and (^>server operations create conflicts only 
when a restart is cheaper than a delay), and genenrtes less conflicts for long 
computations. 

We will also describe a time-range concurrertcy control algorithm in which each 
computation is associated with a time-range indeed of a timestamp. The static and 
dynamic algorithms can be shown to be special cases of this algorithm. The time- 
range algorithm allows the user to choose a "privileged" cAass of computations that 
can be made to be serialized after all other computations except those also in the 
privileged dass. The ability to do so reduces the possibility that a privileged mutator 
computation is restarted or deleted. 



135 



5.2.1 Hierarchical Concurrency Control Algorithm 

Suppose each computation is given a period identifier and a serialization identifier. 
The serialization identifiers can be assigned with unique timestamps (from a real-time 
clocl<). The two identifiers are concatenated, with the period identifier more 
significant, and used to determine the serialization order of the computations.^^ 

Period identifiers are not necessarily unique. Computations receive their period 
identifiers from period counters. We assume each site has its own period counter, 
which is updated with the current clock value Mrfien a distributed computation is 
started at this site, or when the period klentifier of an incoming distritHJted 
computation is larger than the current period counter.^ The period counter will lag 
behind the clock mo^ of the time, assuming that most computations are tocal. 
Notice that, although the period telentifiers are not unique and lag behind the real 
time clock, the same is not true for serialization identifiers. Local computations in 
this algorithm are similar to those in the static algorithm in that they are unlikely to be 
restarted in their short duration and can be restarted inexpensively. 

Distributed computations perform better in this algorithm than in a static algorithm. 
Consider a distributed computation c started at clock time t; it will have a period 
identifier and a serialization kJentifier, both a^aproximat^y t. Conner the period 
counters at the remote sites that c will visit. If they are also t at the time c is started, 
then this algorithm will have the same performance as the static algorithm because it 
is just as likely that conflicts will be created. If th^ are greater or smaller than t, then 
this algorithm will p^orm more pooriy or better respecth/^y. Given that a period 
counter at a site s is updated only when there are other di^ritxjted computations 
visiting or started at s, the period counters at the remote sites that c will visit are likely 



^^The aeritfization identifiors can be extanded to non-overtappfaig 6ma rangea to hs^idle parsriiel 
sidj-actiona. 

^^e asswne that whether a computation is local or dMributod can bedetennined, for instance, from 
the syntax of the profp^am. In any case, this informstkxi is only a hint and doea not arffect the 
correctness of the algorithm. 



136 



to be less than t at the time c is started. An exception is when the clocks at those 
remote sites are running ahead of the one used to generate t and other distributed 
computations have visited or been started at those remote sites recently. If we 
assume distributed computations to be rare or clocks to be closely synchronized, the 
exception is unlikely to happen. 

Given that the period counters of the remote sites that c will visit are smaller than t, it 
will be less likely for c to be aborted due to an old timestamp when c finally arrives at 
a remote site s. This is because the local computations started at s before s's period 
counter exceeds the period identifier of c will be serialize before c, and not cause c 
to be restarted. This algorithm performs better when distributed computations are 
infrequent. 

Note that incrementing the period counters is an optimization and does not affect the 
correctness of the algorithm. A period counter can be left unchanged when, say, a 
distributed computation that only involves nearby sites is started. To avoid these 
distributed computations being restarted, the period counters of the nearby sites can 
be synchronized frequently by tiringing the smaller counters to the values of the 
larger counters. 

The hierarchical algorithm can be useful in a system in which distributed 
computations and long computations are rare. For example, most of the 
computations in a calendar application will be local. (Dccasionally a distributed 
computation involving a meeting is darted. Also, in many distributed datat>ases, tfie 
majority of computations wilt be tocaA if the data is partitk^ned according to locality of 
reference. 

Consider the two kinds of conflicts that can arise in a system in which distributed 
computations and long computations are rare. First, a mutator ml may be restarted 
if there is an observer o1 serialized potentialiy after it. However, with our assumptk>n 
that distritujted computations are rare, only short mutator computations are likely to 
be restarted and the cost of restarting a ^K>rt mutator computation is small. Second, 



137 



an observer o2 may be delayed if there is a tentative mutator m2 serialized 
potentially t)efore it. If m2 is invoked by a short computation, the cost of waiting for 
m2 to be finalized is small. If m2 is invoked by a long computation, a possit>te 
solution is to restart the computation that invokes o2 with a smaller timestamp. If 
long computations are rare, we may expect the execution of two long computations 
to seldom overlap with each other. Hence, given that m2 belongs to a long 
computation, we may expect o2 to be invoked by a short computation most of the 
time and the cost of restart of o2 is small. However, restarting a computation with a 
smaller timestamp is not always p(^sible as the computation may invoke mutator 
operations. Hence short computations that invoke both observer operations and 
mutator operations may have to incur a high cost in being delayed by a long mutator 
computation.^^ 

In a system where ot>jects support only read/write operations, it is unreasonable to 
expect that short computations would invoke either only c^^erver operations or only 
mutator operations. In a system where objects support atetract operations, this 
expectation is more likely to be valid. If the system also has the characteristic that 
distributed computations and long computations are rare, the hierarchical algorithm 
can be used to minimize costly conflicts. The hierarchical algorithm is also 
preferat>le to the dynamic algorithm because an incomplete long computation, 
though infrequent, can cause many otfier sut>sequentty started ^ort computations to 
be delayed. 

5.2.2 Time-Range Concurrency Control Algorithm 

The time-range algorithm we are going to descriiie is similar to the dynamic 
timestamp allocation ^otocol described t^ Bayer in [4] but with several important 
differences. We will descritse Bayer's algortthm first and \h&n ttie differences. 



Long compuUMcns that invoke lx>th observer optn/Oona and muti^or operatkms are less likely to 
be delayed by ottier long computatkxis because we expe^ an overlap of execution of two king 



computatkxw to be rare. 

138 



Bayer's Algorithm 

In Bayer's algorithm each computation is associated with a time range (t1 , t2) such 
that if the upper time txsund of a computation a is less than or equal to the lower time 
bound of another computation b, then a is serialized t)efore b. These time ranges 
can be shrunk dynamically but not expanded. The range will be shrunk to a single 
unique value when the corresponding computation is finalized. The upper time 
tjound can initially assume the value infinity while the lower bound can assume 
negative infinity. It should be noted that for external consistency reasons, a 
computation probably should not be started with a lower time t>ound much smaller 
than the current time. 

The static and dynamic algorithms are obvious special cases of this algorithm. The 
static algorithm starts with a time range with a single value. The dynamic algorithm 
has each computation associated with a time range in which the lower time bound is 
the current time, and the upper time bound is infinity, since the locked point of the 
computation can happen any time between the current time and the indefinite future. 

The utility of this algorithm lies in its ability to shrink the time ranges dynamically so 
that conflicts can be avokied. For example, if computation a has a time range of 
(t1 , 12) and computation b has a range of (t3, t4), then a csa\ be serialized after b by 
raising t1 or shrinking t4 until t1 is greater than or equal to t4. Obviously this is not 
possible when t2 is less than or equal to t3. In those cases shrinking is disallowed 
and a has to be restarted if it is trying to invoke a mutator operation and b has 
invoked an ot>server operation.^ 

Our Time-Range Algorithm 

In our time-range algorithm, time ranges are extended to a more general form: 



^Parallel sub-actions can be serialized by sub-divicHng the time range of the parent actkxi Into 
non-overiapping time ranges. 

139 



(max(L1 , L2, ..., Lm), min(U1 , U2, ..., Un)) 

where Li and Ui can be either a constant real numtier or a computation identifier. In 
the algorithm that we have described above, there is no way to ensure that a 
computation a will be serialized before/after b by shrinking a's time range if b's 
lower/upper time bound is negative infinity/positive infinity and cannot be 
changed^^. To overcome this limitation, we allow the computation identifier of b to 
appear in a's lower/upper time bounds, which implies that b must be serialized 
before/after a. Initially a's time range can start with (negative) infinity or a constant 
in its upper or lower tx^und. The time range can be shrunk and computation 
identifiers of other computations, such as b's, can be added to ensure particular 
serialization order relationships. 

We assume that each computation is associated with a site, called its coordinator, 
that keeps track of the final timestamp value of that computation. When b is finalized 
and the time range of b is shrunk to a single constant value, the sites that keep 
copies of a's time range can request this value from b's coordinator arnJ replace the 
computation klentifier with the constant. We cati this process the binding of the 
computation kJentifiers. We will discuss how binding information can be propagated 
later. 

To make sure that the time range is not empty, i.e. the lower bound is smaller than the 
upper bound, a computation shoukt not commit until all the computation kientiflers in 
its time range are bound. Any computation witfi cm «npty time range is aborted. This 
rule guarantees that if a cycle of serialization orderings is formed with each 
computation in the cycle assumed to be serialized tsefore the next computation, at 
least one of the computations in the cycle vdii be prevented from committing. This is 
because one of the computatior» in the cycle must have an empty time range. 

When a computation is aborted, infinity can be as^gned to its computation identifier 



23 
Chmging b's time bound may involve sending messages to other sHm and require a long delay. 



140 



if it is used as an upper time bound, or negative infinity if it is used as a lower time 
bound. This rule implies that when a computation a is to be serialized after two other 
computations b and c, a must include both b's and c's identifiers in its lower time 
tx)unds, even when b is constrained to be serialized after c. If a includes only the 
computation identifier of b in its time range and b is e^}orted later, the serialization 
ordering between a and c is expr^sed in neither a's nor c's time range. 

When a cycle is formed, two different scenarios may happen. In the first scenario, 

some of the computations in the cycle will commit and at least one of the other 

computations will discover that it has an empty time range. For example, if the time 

ranges of the computations a, b, and c are as follows: 

a:(t1.t2) 

b: (max(a, t3), mln(c, t4)) 

c: (t5, min(a, t6)) 

assuming that t1 < t2, t5 < t6, and the system cho<»^ a final timestamp value for a 
in (t1 , t2) that is larger than t5, a and c will be committed eventually but b will be 
aborted because c's final timestamp value is \eaa than a's. 

Deadlock Resolution 

In the second scenario, a deadlock will devek>p, such as when: 

a:(b,t1) 
b: (a, t2) 

A deadlock detection algorithm can be used to abort one of the computations in the 

cycle. However, not ail deadlocks represent a cycle in the serialization orderings. 

For example, we may have: 

a:(b,t1) 
b: (t2. a) 

where t1 > t2. In this example, a is assumed to be serialized after b and b is 
assumed to t>e serialized before a. These assumptions are obviou^y compatiiiie and 
a serialization order is not ruled out by them. However, a deadlock is devetoped 
t)ecause both a and b are waiting for the other to finalize. 



141 



To avoid aborting any computation when these deadlocks occur, we can rely on the 

deadlock detection algorithm to switch the "direction" of waiting. For example, if a 

appears in the upper time bound of b's time range and hence b is waiting for a to 

finalize, b's computation identifier can be added to a's lower time t)Ound and then a's 

computation identifier can t>e removed from b's time range and replaced with the 

upper time bound of a. In our previous example: 

a:(b,t1)-^(max(b, b),t1) s (b.tl) 
b:(t2,a)-^(t2,t1) 

The switching preserve the correctness of our ^gorithm because at least one of a 
and b is waiting for the other to finalize at erii times. After the switch, a is waiting for b 
instead. In our example, b can proceed with its commitment and a can be committed 
if t2 is less than t1. 

To avoid having switchings that nullify one anotha^'s ^ects and to ensure that the 

deadlock will be resolved eventually, the switching can be limited to one direction. 

For example, we can limit the algorithm to remove computation kjentifiers only from 

upper time bounds and insert then only into lower time bounds. To avokj creating 

deadlocks with the switching when there are not any, klentifiers can only be removed 

from the upper time t)Ounds if there are not any other computation kJentifiers in the 

lower time bounds. In otfier words, the removal shoukj alk>w the computation to 

commit. In the prevtous example involving computations a, b, and c, we will never 

have: 

a: (t1 , t2) -» (maxCc, t1), t2) 

b: (inax(a, t3), inin(c, t4)) ~* (max(a, 13), t4) 

c: (t5, min(a, t6)) -* (max(b, tS), t6) 

Binding Computatiofi Identifiers 

To make sure that every computation identifier used isy a time range will be bound 
eventually, we have to make sure that tiie final timestamp value oX a committed 
computation c (we will discuss ^x>rted computatioi^ lafer) will be sent to each site 
that it hati visited, in addition, ^nce other computations that h»j visited those sites 



142 



might have included c's computation identifier in their time ranges and caused it to 
appear in other sites, the final timestamp value of c has to be propagated to those 
other sites as well. Notice that tiecause of the indirect propagation of c's identifier, 
the sites visited by c may not overlap with the set of sites visited by another 
computation d that has c in its time bounds. 

In order to make sure the computation identifiers can be bound eventually, we 
assume the coordinator of each computation c remembers the following in stable 
memory when c commits: 

1 . c's final timestamp value, 

2. a list of all the sites that c had visited, 

3. the computation identifiers that c had used in its time tx>unds. 

After commitment, the coordinator will send c's timestamp value to all the sites that c 
had visited, which can be piggyt>acked on the messages ttiat the coordinator uses to 
convey the outcome of c (see section 5.4). The coordinator will also try to find out 
the final timestamp values of the computation kientifiers that c had used and send 
those values to the sites that c had visited. This is necessary as other computations 
may have learned those computation identifiers from c. Only tfien all these messages 
are acknowledged can the coordinator discard the information that it had stored 
during commitment. At each site t}eing visited by c, ea^ copy of the computation 
identifiers, if there is more than or^, will be replaced with the final timestamp value 
before acknowledgment. 

To see that every computation a^ that has the computation identifier of a 
computation a ^ in its time ran^ will eventually learn of the final timestamp value of 
a^ , consider the path of computations a^ , dj. •-> a„ along which a^ learns about the 
computation kientifier of a^. (The computatkm Bj accesses an object accessed by 
a^ and includes a ^'s kientifier in its time range. Then a^ accesses an object 
accessed by aj and includes a^ 's identifier in its time range. Eventually a^ accesses 
an object acceded by a^,^ and includes a^'s kientifier in its time range.) Note that 
each pair of adjacent computations on this path visited some site in comnnm. Since 



143 



the coordinator of a^ makes sure that each site that it visited learns about its final 
timestamp value, the coordinator of Bj can find out a^'s final timestamp value from 
the shared site, where aj first learned about a^'s computation identifier. Similarly, 
after aj sends that value to every site that it had visited, a^ can learn atxjut the value 
from the shared site between aj and a^. The process is repeated until a^ learns 
atx)ut a^ 's final timestamp value. 

A complication arises when some of the a|'s are aborted. In the algorithm that we 
descrik)ed above, a^ wilt t>e waiting indefinitely for the final timestamp value of a^. A 
solution is for a, ^ ^ to remember a list of all the sites and tfra name of the actions 
(e.g., the name of a,) from which it has learned about a^ along with the name of a^ in 
stable memory when it commits. Instead of waiting for a, to propagate the final 
timestamp value of a^, a,^^ can send queries to each of the sites in the list. If 
records at)out those actions from which a, ^ ^ learns £d>out a^ cannot be found in any 
of the sites in the list and none of those sites is in the process of sending out a^'s 
final timestamp value, a, ^ ^ can propagate the value of positive infinity to a, ^ 2 '' ^i '^ 
identifier is used as an upper time bound by a, ^ ^, or negative infinity if it is used as a 
lower time bound. This is because the serialization constraint is established between 
a, and a,^ ^, instead of between a., and a,^ ^. This solution is correct only if a,^ ^ 
limits the propagation of the infinity value to a, ^ 2 ^"^ '^^ to any other computation 
that happens to use a^'s identifi^. So when a site receives an infinity value from 
a, ^ ^ , it should t>ind an a^ identifier in its niemory only when the identifier has been 
learned from a, ^ ^ . 

Privileged Ck>nnputations 

In the rest of this section we will describe an optimization that will allow the 
computations in the system to have different priorities. In particular, we can use tfie 
optimization to msrite mutator computations less ifl<sly to be re^arted. Suppose tfiere 
is a class of computations with the following form of time rwiges: 

(max(L1 , L2, ..., Lm), 00) 



144 



and the property that the identifiers of these computations are not allowed to be used 
in the lower time bounds of any computation. Consequently, the finaJ timestamp 
values of these computations are never required to be smaller than any other value, 
and since they have no upper t>ound, we can always find real constants that exceed 
their lower time bounds. In other words, these computations can commit even when 
there are unbound computation identifiers in their lower time txHinds. Choosing final 
timestamp values for these computations has to be delayed until the unbound 
computation identifiers are bound, however. 

These computations are "privileged" t}ecause they can always avoid being restarted 
by including the upper time bounds or the identifiers of other computations (except 
those in the privileged class) in their lower time bounds. It should be noted, however, 
that a privile^d computation may still t>e delayed due to tentative mutators that are 
serialized potentially before itself. 

Because privileged computations can commit without binding their time ranges, a 
deadlock involving committed computations can be developed. Because of the 
restriction that identifiers of (privileged computations cannot be used in the lower time 
bounds of other computations, a deadlock must involve non-privileged computations, 
which must be uncommitted and can be chosen as victims to be aborted. 

The time-range algorithm is useful in a system whwe tfie only long computations are 
mutator computations. By asslgnirtg the long mutator computations as privileged 
computations, the mutator computations can avoid being restarted t>y other ot)server 
computations. Mutator computations are sriso not delayed b^ tentative computations 
because they do not observe any state. Short observer ccmiputations in the system 
can avoid being delayed t>y tentative long mutator computations by restarting with 
smaller timestamps. The cost of restart is low. However, this may not be possit>ie if a 
short computation invokes both observer and mutator computations. Compared to 
other multi-version algorithnn [9, 48], our erigorithm has the advantage that the k>ng 
mutator computations are never restarted t)y the concurrency confrol crf^rithm. The 



145 



cost of our time-range algorithm lies in the complexity of manipulating time ranges 
and sending messages around to bind the computation identifiers. 

Examples of applications that have only long mutator computations and short 
otjserver computations are databases that are replicated on many sites for availability 
and efficiency of ot)server operations. Mutating the state of one of these databases 
is a long computation because of the large number of replicas. Frequently, a mutator 
computation also does not ot>serve the state of the datatuee, such as when old data 
values are overwritten with new data values. On the other hand, usually only short 
queri^ are directed at database because most data is available from the local site. 

5.3 Making Concurrency Control Algorithms Transparent 

In the previous two sections we have described various concurrency control 
algorithms. We have shown that under special situations concurrency control 
algorithms can be adapted to minimize costly conflicts. For example, in the case of 
the hierarchical algorithm, long computations would not suffer from repeated restarts 
when th^ are rare. 

Given that different concurrency control algorithms might be appropriate in different 
applications, we have deigned a programming interfeK^e which hides the 
concurrency control algorithm used urKtemea^. The history operations, such as 
pjsub or dj>rior, make the algorithm in which the aeriaiization order is determined 
transparent. The retry statement sdso med^es the £K:tk>ns that need to be taken when 
a conflict arise transparent. 

This section describes how to implement such a programming interface given a 
particular concurrency control algorithm. In section 5.3.1 we will describe the 
implementation of the history operations that csMStura the serialization order. In 
section 5.3.2 we will describe ttie implementatton of the retry statement. 



146 



5.3.1 Implementation of History Operations 

In this section we will describe how the history operations psub, pjjrior, dsub, and 
clj)rior can be implemented given that a static or a dynamic concurrency control 
algorithm is used. Our goat is to show that th^e operations can be implemented and 
our descriptions will not focus on efficiency. The operation p between can be 
implemented by filtering a history ot)ject with pj)rior arnj pjsub. D_between can be 
implemented with dphor and djsub »milarty. 

Figure 5-1 defines the subset of transitions that should be returned by the sub and 
prior operations for a static and a dynamic cxmcurrency control algorithm. In the 
dynamic algorithm, we assume that each transition is labelled with two timestamps 
from a Lamport clock [27]: an operation timestamp and a commit timestamp. The 
operation timestamp is read immediately before the corresponding operation returns, 
and the commit timestamp is read when the computation commits. For the operation 
being executed currentiy, the current clock value can be used as its operation 
timestamp. We assume that th^ timestamps are remembered in a history ot^ect. A 
commit timestamp can be piggyt>acked on a messa^ that informs a site of a 
computation's outcome and recorded in a history object when the computation's 
status in the history object is updated. 

Implementations for other concurrency control ^gorithms are similar to those in 
figure 5-1 . For exami:^, the implementations for the hierarchical algorithm and the 
static algorithm are the same except that the two timestamps for a computation are 
concatenated for comparison in the former and a single time^amp is used in the 
latter. In an implementation for the time-rang^ idgorithm, a transition can be 
serialized potentially before or etfter another transition if their time ranges can 
possitriy overlap. Otherwise, one of them is serialind (finitely before the other. 



147 



static serialization concurrency control algorlthn: 
(Implementations for the d_ counterparts are identical.) 

psub • proce(lure(h: history, t: transition) returns(hlstory) 
return transitions In h with larger tlaMStmps than t 
end p_sub 

p_pr1or - procedure(h: history, t: transition) return8(h1story) 
return transitions In h with snaller tlaestasips than t 
and p_pr1or 



Dynanic serialization concurrency control algorltha: 

p_sub - procedure(h: history, t: transition) returns(hlstory) 
If t has a coMilt t1aiesta«p c 

then return all finalized transitions that have larger 

coMilt tinestaaips than c and all tentative transitions 
In h 
else return all tentative transitions In h and all finalized 
transitions that have larger cowilt tinestaaps than 
the operation tlnestM^ of t 
end 
end p_sub 

d_sub " procedure(h: history, t: transition) returns(hlstory) 
If t has a coiMit tlMestaaip c 

then return all finalized transltlona that have larger 

coflMilt tlsMStaups than c and all tentative transitions 
that have larger operation tlneataa^s than c in h 
else return an empty set 
end 
end d.sub 

p_pr1or - procedure(h: history, t: transition) returns(hlstory) 
return (all transitions in h - d_sub(h, t)) 
end p_pr1or 

d_pr1or •> pr6cedure(h: history, t: transition) returns(hlstory) 
return (all transitions In h - p_sttb(h, t)) 
end d_pr1or 

Figure 5- 1 :implementations for Sub and Prior 



148 



5.3.2 Implementation of Retry Statement 

This section describes how the retry statement can be implemented given a 
concurrency control algorithm. In particular, we will use a static algorithm as an 
example. We will also describe implementations for other concurrency control 
algorithms, although more briefly. Our description of implementations will focus on 
their feasibility, but brief references to efficiency will be made occasionally. We will 
first present an example of the kind of decision making that is involved in the 
execution of a retry statement. Then we will describe an implementation. 

When a retry statement is executed in a system with a static algorithm, the language 
system should decide >A^ether the computation executing the statement should be 
delayed or restarted, and if it is dela^d, when it should be rescheduled. With a 
dynamic algorithm, the only possibility is to delay a a>mputation. The only decision is 
when to reschedule a computation. With a time-range algorithm, the decisions are 
more complicated. A computation can be delayed, restarted, or have its time range 
shrunk in different ways. 

When the system is faced with these deci^ons, there are no optimal decisions 
without knowledge of the future. Heuristics are needed to determine the relative 
likelihood of correctness and cost of each of the choices. For example, it is 
reasonable to expect that a tentative transition is more likely to commit than to abort 
and make decisions accordingly. We will also assume that it is unlikely to have an 
operation invcAed in the future serialized before some existing operations. 

An Example , 

Consider the proceed condition 

~h1story$ex1sts(h1story$p_sub(th1s_trent1t1on), no.x. 
not_changed((l«1_x) ) 

in the Insert procedure in figure 4-3 where 



148 



not.changed - proc«dure(op: templatar. t: transition) returns(bool) 

return(h1story$«x1sts(h1story$d_b«tt«««n(th1s_trans1t1on, e)). 

coMlttad.dal.x)) 
end not_chang«d 

Suppose the proceed condition evaluates to f a1 sa and a transition t is the only no_x 

transition that causes the conditicm to be false. The proceed condition will be 

satisfied when either: 

1 . t is aborted, or 

2. t is serialized definitely before the invoked operation, or 

3. a coflMi tt*d_da1_x transition is serialized definitety between the invoked 
operatton and t 

Item 1 is unlikely to happen, regardless of the concurrency control algorithm used. 

Suppose a static concurrency control algorithm is used. Item 2 will only happen with 
a restart because the predetermined serialization order does not change. Item 3 is 
only likely to happen if th^e is already an toitirtive dtl.x transition serialized 
between the invoked operation and t. In those cases, the inv<Aed operation can be 
delayed until the dal_x transition is finalized. In crther csees, the invocation request 
should be refused aruJ the computation that invokes it restarted. Although it is 
possible that the restart is unnecessary after all, it is the most appropriate choice 
under our assumptk>ns. 

If a dynamic concurrency control algorithm is used, delaying the current operation 
cannot cause item 3 to become true. In fact, it wouki achieve the opposite effect. 
Also, a dal_x transition may not exist after ail. Item 2 can be fulfilled by delaying the 
current operation until t is finalizod, tiie moaX ^spropriate step to take in this case. 

If a time-range concurrency control algorithm is used, the system may have several 
choices. The time range of the current ccmiputation may be shrunk, if necessary and 
possible, in such a way that ib&m 2 is satisfied. If a coNi1tt«d.d«l.x transition exsts 
and it is definit^y serialized b^ore t, the time range of tiito actkm may be shrunk 
such that item 3 is satisfied. If a t«ntatlvt.d«1_x ^anaition exists and it is swialized 
potentially before t, tiie current operation can be (teii^ed until the serialization order 



150 



is known and the transition committed. 

An Implementation for a Static Algorithm 

Since we have limited our proceed conditions to be constructed with txx)lean 
operations and history operations, jxogram analysis can be used to decide the action 
to take when a retry statement is executed. The goa) of the program analysis is to 
determine whether a proceed condition is likely to be satisfied eventually without a 
restart. Only when should an operation be delayed. We assume that ^en a system 
is confronted with a choice of delaying or refusing an operation, delaying is 
preferred. A more sophisticated decision can be t)ased on the expected costs of the 
delay and the restart. 

Choosing Between Delay and Restart 

To shorten our presentation, a^ume that a boolean operation can be either and, or, 
or negat*. However, we would eliminate all the nagata operations that are not 
immediately applied to the result of eui ax lata op^^on t^ ntaicing suitable program 
transformations. For example, if a proceed condition is of the form: 
~(ax1ata(h, t. p) and ax1ata(li'. t', p')) 
we change it to: 

~ax1ata(h. t, p) or -axlataCh' . t*. p') 
If a proceed condition c is of the form: 

l.cl and c2: then c te likely only if both cl and e2 are likely. 

2. cl or c2: then CIS likely only if at least one of d arid c2 is likely. 

Other than the two forms above, c can also be of the form axlata(l), t. p) or 

~axlata(h, t, p) where h is a history ot^ect, t is a transition template, and p is a 

procedure. In order to allow the tsuiguage astern to determine the likelihood of an 

axlata or ••axlata expression, we limit p to be of tiie form: 

p - procodura(arg: tranaltlon) ratifrna(bool) 
raturn(a) 
and p 



161 



where a is subjected to the same restrictions as proceed conditions. If c is of the 
form: 

1.exists(h, t, p): then c is likely (Xily if t is of the form 
coiMltt«d_op_. . . , and there is a transition tr in li that matches op_. . . 
and p(tr) is likely. 

2. -•x1tts(h, t. p): then c is likely only if for all the transitions tr in h, 
either a committed verston of tr does not match t or -pC tr ) is likely. 

Determining whether an exists expression is likely involves searching the history 
object h at run time. The sc»ne process can be used to determine whether p(tr) or 
~p( tr ) is likely after replacing references to arg in • with tr at run time. 

Determining Reachadutea 

Given that a proceed condition is likely to be satisfied without a restart, the language 
system shoukJ determine when the current invoke reque^ should be rescheduled. In 
other words, the language system shoukJ determine when the (M'oceed condition 
becomes likely. 

There are many options for determining v^irt kinds <rf events arKl processing are 
allowed to trigger the rescheduling of a suspenctod operation. For example, a simple 
scheme is to allow only the finalization of a fixed set of trmsitions determined at the 
executton (rf the ratry statement to triggcN* rescheduling. A more complk:ated 
alternative is to also aitow the finalizatton of subeequentty invoked transitions and 
evaluation of artNtrary expressions to determine when rescheduling is appropriate. 
Since the goal of thte section is to show the feraibillty of an implementation that can 
resolve a conflict in a rsasonat^, but not necessarily optimal, fashion, we will use the 
simpler scheme. Anottier rsiecm to use the slmf^er sch^ne is to minimize the cost erf 
scheduling. One of the necessary consequences of using the simpler scheme |s that 
we cannot guarantee that a proceed condition will be met when an qseration is 
rescheduled, as some other operations may have executed between the suspension 
and the rescheduling. However, this is con^derad accepHtfito by our programming 
interface. 



152 



Suppose 8 is a set of transitions such that the finatlzation of a transition in s should 
trigger the rescheduling of an operation with a proceed condition c. A program 
analysis similar to the one above can be used to determine a non-empty s. If c is of 
the form: 

l.cl and c2: then sis the union of the sets that trigger eland c2. 

2. cl or c2: same as above. 

3. •x1sts(h. t. p): if tr is a tentative transition In h such that if it is 
committed, it would match t and p(tr) would return tru«, then tr is in 

4. ••exl8tt( h , t , p): if tr is a tentative transition in h that matches t and 
p( tr) returns tru«, then tr is in 8. 

AM Pglaygg Qpff raiiona arg Reech^dylftti EvgntuaHy 

To show that this implementation is correct, in the sense that if an operation is 
delayed, it will be rescheduled eventually, we need to show that s is not empty. We 
will now descrit}e an informal argument showing that it is indeed the case. 

Recall that a well-formed proceed condition satisfies the fottowing requirements: 

1 . The proceed condition should be satisfied if: 

a. new ^Derations are not started, and 

b. all current CH)erations in the sy^em, except the one being 
considered, are finalized and the outcomes are known by all 
hi^ory objects, wid 

c. the operation beir>g considered is serialized after all existing 
transitions and the serialization order among existing transitions 
are known. 

2. It is not satisfied currentiy. 

3. It is constructed with boolean operations and the operations provkJed by 
the history objects. 

Given that a proceed condition c is not satisfied cuirentiy, there must be some 



24 
More accurately, the commitment of tr is in s, since aborting tr would not make c become Ukeiy. 



153 



ex1sts(h. t, p) or ~ex1sts(h, t, p) expressions not satisfied currently. Given 
that c is not restarted and hence likely to become satisfied eventually, at least one of 
these expressions is likely to k)econne satisfied eventually. Suppose an ex1sts(h. 
t. p) expression is likely to tiecome satisfied eventually. Following our definition of 
when axIstsCh , t, p) is likely, we know that there is a transition tr in h such that 
either 

1 . tr does not match t t)ecause tr is tentative, or 

2. p( tr) is not satisfied currently t>ut is likely to be satisfied eventually. 

Given our rules for adding transitions to the triggering set. tr will be in the set if the 
first case is true. If the second case is true, induction can be used to argue that the 
program analysis of p(tr) will \e&ci to the addition of some transitions in the 
triggering set. A similar argument can tie used when an ~«x1tts(h, t, p) 
expression is not satisfied currently but likely to become satisfied eventually. 

In addition to guaranteeing that an operation will be eventually rescheduled if it is 
delayed, there is also a performance issue that unnecessary restarts should be 
avoided. This is achieved with the first requirement for the well-formedness of 
proceed conditions. By requiring a proceed condition to be satisfiat}le given the 
conditions l.a, l.b, and l.c, we ixevent an appfication from specifying a proceed 
condition which is unlikely to become satisfied wf>en in fact evi operation is likely to 
be able to proceed eventually. 

For systems that use a dynamic or time-range concurrency control algorithm, rules 
similar to those above can be used to determine whether to restart or delay an 
operation, and. if the operation is delayed, when it is rescheduled. Correctness in the 
sense that a delayed operation is eventually rescheduled b not difficult to achieve as 
long as every delayed operation is rescheduled occasionaily. The complexity of an 
implementation is in determining which set of events should trigger rescheduling and 
whether restart, delay, or some particular way erf shrinking a time range should be 
employed. It is debatable whether a progrmnmer or a language s^em 



154 



implementation can make iDetter decisions. For example, we have discussed that the 
relative merits of restarts and delays depend on their expected costs. Having a 
language implementation calculate these costs avoids cluttering a program with 
optimizations. However, one may argue that a programmer has a better knowledge 
of these costs. 

5.4 Commit Protocols 

When a distributed computation commits or aborts, the sites that participated in the 
computation have to agree on its outcome. At any time during ttie process of 
reaching an agreement, site crash^ or communication failures can occur. Once a 
computation is committed, each site should make sure that the computation would 
appear to have executed despite site crashes and (x>mmunication failures. The sites 
that participated in an action should also be informed of the action's outcome as 
soon as possit)le, so that other actions will not k)e delayed. The protocol followed by 
the sites to reach agreement is called a commit protocol. 

Section 5.4.1 reviews the two-phase commit protocol [17]. Section 5.4.2 describes 
an alternative, the one-phase commit protocol, and compares the two. We argue that 
the one-phase commit protocol is more suitalsle in our environment. In the 
description of these protocols, we assunrte that call and returr) messages are used to 
invoke processing on remote sites and to return results of those ^orts. 

5.4.1 Two- Phase Commit Protocol 

The most common commit protocols used t^ distributed systems are two-phase 
commit protocols. In a two-phase commit protocol, orte of the sites plays the role of a 
coordinator and the other sites become subordinates. We assume that the site that 
initiates the top-level action plays the coordinator role, and the other sites that have 
participated in the computation are subordinates. 

At the end of the computation, if commitment is desired, the coordinator will send 



155 



prepare messages to the subordinates and wait for their replies. In a system with 
nested actions, only the subordinates with non-atx>rted sub-actions need to receive 
these prepare messages. At the subordinates, a yes vote is returned if commitment is, 
desired. A no vote is returned otherwise. Before a yes vote is returned, the 
sut>ordinates can decide to abort the computation uniiateraily. In those cases, a no 
vote can be returned when prepare messages arrive. 

At the coordinator, if all the votes are yes votes, the computation can be committed 
by writing the decision to stat>le memory atomically. Afterwards, 

commitjsomputation messages wilt be sent to the 8utx>rdlnates. If any of the votes 
returned is a no vote, or the coordinator has given up waiting for all the votes to 
return, abort computation messag4» can be sent to the subordinates that had sent 
yes votes. Abortjcomputation messages can aiao be sent anytime during the 
execution of the computation. A parent action can also aeruj abort. action messages 
to abort sub-actions before the end of a computation. 

Commitjsomputation messages and abort .computation messages are mutually 
exclusive. A computation should f)ei>i&r send both types of messages. Through the 
commit computation and abort, computation messagM, the subordinates will learn 
that the computation is finalized. The sending of prepare and vote messages is the 
first phase, and the sending of commit/ abort .computation messages tfie second. 

When sending messages to the subordinates, eitfier the coordinator can send to 
each subordinate directly, or the messages can be relayed by other subordinates. A 
convenient strategy is to have the site of a parent action r^ay the messages to its 
sub-actions [37]. The first messages are sent by ttie ^e that executes the top-level 
action, the coordinator of the computation. The str^egy \& convenient because each 
parent action knows the names of its sub-actions, wheve ttiey are executed, and 
whether they should be aborted or committod. HcNmwr, having the coordinator 
send the messages directly avoids my ctetay in relaying. To do so, each action 
should include the nofnes of its sub-actions, where they are executed, and wfiether 



156 



they should be aborted or committed when it returns to its parent. In this way, the 
top-level action will collect all the necessary information to send the messages 
directly. 

5.4.2 One-Phase Commit Protocol 

An alternative to the 2-phase commit protocol is a 1 -phase commit protocol. In the 
1 -phase commit protocol, no prepare or vote messages are sent. A site is prepared 
to commit when it sends a return message. It stays prepare until notified by the 
coordinator to commit or abort. The 1 -phase commit protocol takes one less round- 
trip delay to finish. In a sy^em with long communication delays, this is an important 
savings. In a simple 2-site distritujted computation using a 2-phase commit protocol, 
the coordinator and the sutiordinate are informed of the outcome of the computation 
after 2 and 2.5 round-trip delays respectively. With a 1 -phase commit protocol, the 
delays are reduced to 1 and 1.5 round-trips respectively. 

One of the advantages of the 2-phase commit protocol over the 1 -phase commit 
protocol is that a sut>ordinate retains the privilege to abort a computation unilaterally 
until it has responded yes to a prepare message. Presumably, by atx^rting an 
tentative computation, a site can recover the resources held t)y that computation. 

It is not clear whether this window of vulnerability, during which a sutK>rdinate has to 
wait for a decision from its coordinator, is in fact shorto* in a 2-phase commit 
protocol than in a 1 -phase protocol. In a 2-pha8e commit protocol, the length of the 
window is at least the time required for a w^ to trav<N to the coordinator and the 
decision to come back to the participant. In addition, assuming that most 
computations commit, the coordinator has to wait for £dl the votes before sending out 
the decision in most cases, in a 1 -phase commit protocol, the length of ttie window is 
determined t)y the time required to execute the rest of the computation after a 
sut)ordinate has returned f^us the time neecNid for the coordinator to send a 
decision. If a site is accessed near the end ctf a computation and sending messages 
to sites accessed in Vne t>eginning of tfie computation from the coordinator leads to 



157 



long delays, then the site accessed near the end of the computation has a shorter 
window with a 1 -phase commit protocol. On the other hand, the window is probably 
longer for sites accessed in the beginning of a computation if the computation 
accesses more than two sites serially. In a simple 2-site distributed computation the 
window of vulnerability is approximately a round-trip delay in length for both 
protocols if we ignore the time the coordinator uses to compute after it has received 
a reply from the subordinate. This period of computation should be negligible 
compared to the roucKl-trip delay. The same argument can be applied to an n-site 
distributed computation in which the coordinator invdtes the n-1 participants in 
parallel. 

In a 2-phase commit protocol, t^ delaying the preparation of an action until the 
coordinator is ready to commit, there is a possitiiiity that several actions' 
preparations can be piggybacked in a single write to ^able memory. In a 1 -phase 
protocol, a sub-action that executes in the same site a» some of its ancestors can 
delay its preparation until the oldest ancestor returns because a site crash before its 
preparation would ^so abort the ancestor. Otherwise, it has to be f^epared b^ore it 
returns. The 2-r^ase commit i^otocoi is more efficient if acceswng stable memory is 
an expensive operation. 

A compromise t)etween the 2-phase and 1 -phase commit protocols is to leave a 
choice in the ixotocol. When a ^bordinate returns, it can s^ a flag in the return 
message to indicate whether it has prepared. If it has not, the coordinator has to 
send a prepare message and wait for a yes vote from ^at »jbcMPdinate before the 
coordinator can commit. Meanwhile, the sut}ordinate can piggytmck its preparation 
with a later stable memory access; afterwards, ss an optimization, it can send a yes 
vote to "catch up" with its return. In other words, the fx'eparation can become an 
asynchronous process as long £q it is performed before the computation is 
committed. In Chapter 7 we will discuss the use of (^ledtpoirrts, of which art early 
preparation is a special case, to increase the resilience of a computatk>n. 



156 



There are other commit protocols proposed in the literature. Skeen proposed a 3- 
phase non-blocking commit protocol in [52]. In addition to the extra delay, the 
assumptions at}Out the communication network in his protocol are incompatible with 
our model. We believe that the 1 -phase commit protocol is more appropriate in a 
system with long computations tiecause of the reduced delay with one less phase of 
messages. 

5.5 Summary 

This Chapter discussed how the programming interface descrit>ed in Chapter 4 can 
tte supported. In particular, we showed that it is possible to mask the concurrency 
control algorithm used in a s^em. We have described how history operations, such 
as pjsub or dprior, and the p«tpy statement can be Implemented in different 
concurrency control aigorithrr^. We have also proposed two novel concurrency 
control algorithms which minimize the likelihood of costly conflicts given that special 
conditions are met. We have descrit)ed commit protocols t>riefly and described a 
1 -phase protocol which has a shorter delay between an actkDn returning and its being 
finalized. A compromise between a 1 -phase protocol and a 2-phase protocol using 
an asynchronous preparation allows the cost of accessing stable storage to be 
reduced. 



159 



Chapter Six 
Power of Atomicity 



In this chapter we compare our atomicity definition with dher correctness definitions 
In which atomicity Is abandoned. Atomicity is used in this thesis to nrodel 
computations because It Is easy to understand and reason about. We have also 
shown that the concurrency of a system can be increased by using semantics in an 
implementation. In particular, by incorporating the functtonality of an application into 
the atomicity definition, our approach allows a trade-off between functionality and 
concurrency. However, if there were other correctness definitions which permitted 
more concurrency, the Importance of concurraicy might (Xitweigh the simplicity of 
atomicity, especially In a system with long computations. In this chapter we will show 
that our atomicity definition permits as much concurrency as some non-atomic 
correctness definitions. On this basis, we will claim that our model of correct 
behavior is preferat}le, since in comparison It Is equally powerful and easier to 
understand. 

The dass of correctness definitions that we use to compare against our atomicity 
definition is one In which Itie application defines exi;dicitly pairs of transitions that 
"conflict." These definitions insure that computations that involve conflicting 
transitions are executed in the seune order at all objects. A representative of this 
class of correctness d^inltions can be found in [50]. A slightly different but similar 
correctness definition can be found In [38]. We will ctescribe a correctness definition 
which Is slightly more general than the one In [50]. We will call the definition we are 
going to descrlt)e the cons/sfency definition. 

As we have described earlier, the consistency definition insures that computations 
executing conflicting ti'ansitions are executed In tiie same order at all sites. For 



160 



example, suppose computation a executes two transitions a1 and a2 and 
computation b executes two transitions b1 and b2. Furthermore, suppose at 
conflicts with b1 and a2 conflicte with b2. The consistency definition requires that 
either a1 precedes b1 and a2 precedes b2, or b1 precedes a1 and b2 precedes 
a2. More precisely, the consistency definition can be defined with a graph 2K:yclicity 
requirement. The nodes in the graph are computations. Two computations are 
linked by an edge if they execute a pair of conflicting transiticKis at an ot^ect. The 
direction of the edge is determined by the order of execution of the transitions. A 
history of transitions is said to be consistent if the graph is acyclic. A system that 
only generates consistent histories is called a consistent system. 

An equivalent way of stating the same requirement is to require that there exists a 
total order among the computations in the system: If two computations a and b 
execute a pair of conflicting transitions at an ob>ject witii a's transition executed 
before b's, then a is ordered tiefore b in the total order. Notice that this total order is 
different from a serialization order in an atomicity definition, since (xily conflicting 
pairs of transitions are required to be ordered in this total order. Non-conflicting 
pairs of transitions can be ordored in different orders in different objects. There may 
be more than one such total order. 

An example may help in the understarxJing of the consistency definition. Consider a 
banking ^count with deposit.x.okay, withdraw.y.okay, withdraw.y.insuf, and 
read.balanee.z transitions. If the apfriication does not define read.balance.z to be 
conflicting with deposit.x.okay or withdraw.y.okay transitions, then a transfer 
between two accounts, composed of a withdrawal and a (toposit, can interleave with 
an audit attempting to find the sum of the balance in two SK^counts with two 
read.baiance operations. In one of the accounte, tfrn read.balance.z1 transition 
may be exeicuted before the wlthidraw.y.okay transition, wh^eas in the other 
account, the other re8d.balance.z2 transition mi^ be mecuted stfter tfie 
deposit.y.okay transiticMi. In this example, the cunount bting transferred is counted 
twice by the audit. However, we must assume that this t)eHavior is acceptiA)le to the 



161 



application, since it does not choose to exclude it by the definition of conflicting 
transitions. 

This t)ehavior is typical of what real banking systems exhibit in practice. A transfer of 
funds t)etween two accounts is done in two separate parts, certainly when the two 
accounts belong to two different banks and often when the accounts beiohg to 
different branches of the same t}ank. In the case of transfer by check, the deposit 
occurs first, and the withdrawal occurs only after the check has "cleared." The 
clearing of the check involves physical transport of the check and makes the entire 
transfer of funds a long computation. During the time the check clears, the money 
appears to be in two places, which is a way of saying that read.balance.z does not 
conflict with deposit.x.okay or withdraw.y.okay. People have attempted to take 
advantage of the inconsistency by investirtg the doubie-counted money in various 
ingenious ways. The banks have not corrected this problem by imposing atomicity 
across the Federal Res^ve System; rather, they tolerate the problem to a degree and 
control at>uses t}y regulation and law. The tniikJers of banking systems appear to 
believe, as a practical matter, that the imposition of a total ordering among alJ the 
computations wouki produce intolerable loss of concurrency. 

The consistency definition may seem more powerful than atomicity because an 
applicatk>n can specify conflicting transitions explicitly. However, we will show that 
atomicity is at least as powerful as the consi^ency d^nition. In the t>anking example 
above, we can show that t>y defining the functionirfity erf the read.balance, withdraw, 
and deposit operations appropriately, the behavi(M' described above can be modelled 
by our atomicity definition. Our proof does not make the transfer of funds into a short 
computation, nor does it enable the euKlit computation to predict whether a check will 
clear and to return accurate and up-to-date answers. However, by casting the 
uncertainty in the answers returned with an atomicity model aid provkjing the same 
level of concurrency as a consi^ency system, we provkje a simpler model to 
understand the t)ehavior of an application than the consistency definition. The t)etter 
understanding in turn provktes a better framework for the users to deal with the 



162 



inconsistencies that they might observe. Thus the power of atomicity that we show is 
of more than academic interest. 

Our proof is by construction. We show that given any system of objects, their 
transitions, and a set of conflicting transitions, we can construct a system with an 
"equivalent" set of objects, an "equivalent" set of transitions, and serial 
specifications for the equivalent objects, such that the set of consistent histories is 
identical to the set of "equivalent" atomic histories. Consequently, the two systen^ 
have the same behavior and concurrency. The equivalence is defined with mappings 
from one system to the other. The mappings can be used to "simulate" one system 
with the other. 

The problem with the "equivalent" atomic system that we construct is that its serial 
specifications are too complicated to maintain our claim that atomicity is easy to 
understand. Hence our proof only shows that atomicity is at least as powerful, but 
not always easier to understand. We show a second result in this chapter. We show 
that for a class of objects atomicity is as powerful and easier to understand. We also 
argue that this class of ot>jects is a large class. 

Section 6.1 presents an informal version of our proof that atomicity is at least as 
powerful as the consistency definition. Section 6.2 defirM» atomicity and consistency 
with more formal notations smd presents a formal version of \he same proof. Section 
6.3 defines a class of ot}jects called accurate ot^ecte and shows that atomicity is as 
powerful and eas^ to urKlostand for accurate objects. AWiough some objects in a 
system may not be accurate, modeiiing the behavior of the non-accurate objects with 
atomicity allows the behavior of the accurate ot^jects to be ufKlerstood more easily 
than with a consistency definition, if we abandon atomicity in the non-accurate 
objects, we at>andon atomicity in the accurate objects also. 

The correctness requirements to handle situations in which failures can happen are 
usually not specified clearly in the consistefK^y definitions in the literature. However, 
failure atomicity can be incorporated into these definiti<ms in a straightforward 



163 



manner: only committed transitions are considered in determining wtiether a history 
is consistent. We will ignore failure atomicity in our proofs and assume that all 
transitions will be committed. The addition of failure atomicity, which is orthogonal to 
the serializability and consistency concepts, does not change our results. 

6.1 Informal Proof of Power of Atomicity 

Recall that conflicting transitions are required to be executed in a total order of the 
computations in a consistent sy^em. We will call this total order a consistent order. 
Similar to an atomic system, a concurrency control eygorithm is needed to determine 
a global consistent order followed t}y every ot>ject in the consistent system. Also, just 
as in an implementation of an atomic system, conflicts can be created wh«i, for 
example, there is insufficient knowledge of tfie consistent order. Using the 
terminology of the conflict model developed in this th^s, a conflict is created by a 
new transition when tt\em are other transitions that have tfie following fsroperties: 

1 . these transitions are conflicting with respect to tfie new transition, and 

2. they are potentially ordered after the new transition according to tfte 
glot)al consistent order. 

If no conflicts are created, an c^ject can proceed to determine the result to be 
returned, in a consistent sy^em, the result is computed baaed on the order in which 
transitions are executed in an (^iject, which we will call Vhe local execution order. 

The core of our proof is to construct an equivalent atomic system in which conflicts 
are created at the same sttuations ag\d the same results are retunied when th&re are 
no conflict. Since conflicts are created at tfie same situations, the atomic system 
has the same level of concurrency as the consistent ^^em. Since the same remits 
are returned, the atonte system has the same "behayior." More rigorously, since the 
conflict conditions and the validity of results in an atomic ^stdm are determined by 
the serial specifications, we need to construct aeriiri spedfications that guarantee 
that a history in the atomic system is atomic if and only if the equivalent history in the 
conastent system is consistent. 



164 



Before describing these serial specifications, we will describe how the atomic objects 
in the equivalent atomic system can be implements. Presumably, one can argue 
that the same implementation that implements the objects in the consistent system 
can be used to implement the atomic ot^ects. However, we will describe an 
implementation using the mechanisms that we described in Chapter 4, which may 
help in understanding the equivalence between the atomic system and the consistent 
system. 

Just as in the implementations in Chapter 4, each transition executed at an atomic 
object is recorded in a history object. When a new operation is invoked, the history 
object is queried to determine whether there are previously invoked conflicting 
transitions that can potentially t)e serialized after the new transition. If there are, a 
conflict is created and has to be resolved. If no conflict is created, the 
implementation has to determine a valid resutt to return. Since results are computed 
according to the kK^al execution order in a consistent system, the results in the 
atomic system should be computed in the same way. In a practical implementation, 
the transitions in the history object should be m&r^d according to the local 
execution order, so that the snapshot/projection object can be used to ctetermine the 
result efficiently. The local execution order has to be oicoded in the transitions so 
that they can be merged accordingly. 

We will now descritw the serial specifk:ations for the objects in the atomic system 
that create the same conflicts as the objects in the consistent sytiem. Suppose that 
in a consistent system a transitkNi t1 is executed before another transition t2 in an 
object o and 11 and t2 are a pair of conflicting transitkxis. From our definitions, 11 
must be ordered before t2 in any consi^ent order. If we can make sure that, for their 
equivalent transitions t1 ' and t2', t1 ' must be ordered before t2' in any serialization 
order, then a serialization order extets cmly if a con^^ent order exists. Also, If the 
ordering of any such pairs of t1 ' and 12* is the only requirement on a serializati<Hi 
order, then a serisdtzation order exi^s if a consistent order exists. To mate sure that 
t1 ' is ordered before 12* In a serialization order, we can require the collecti<»i of 



165 



conflicting transitions that are executed before t2', such as t1', to be serialized 
before t2\ To express this requirement in the serial specifications, we can encode 
this collection of transitions in t2' and compare this collection with the collection of 
transitions that are serialized before t2'. Since a serialization order exists if and only 
if a consistent order exists, conflicts are created under the same situations. 

An additional requirement on the serial specifications of the equivalent atomic 
objects is needed. In addition to guaranteeing that conflicts are created under the 
same situation, we must also require that the results r^umed in the equivalent 
atomic system are those returned by the consistent ^stem. In a consistent system, 
the validity of a result is determined k)y specifications like the serial specifications. 
For example, if a withdraw operation returns okay, then there must be enough 
deposits executed before the withdraw operation to cover the withdrawal. Since the 
serialization order, though identical with the consistent order, may not be the same 
as the local execution order, we cannot use the serialization order to compute the 
results. In other words, the validity of the results in the atomic system should be 
determined with the local execution order inst^ of the serialization order. 
Consequently, an additional requirement on the serial specifications is that each 
transition should encode the sequence of previously invoked transitions in the local 
execution order and ensure that this tran^tton's result is valid according to that 
order. 

Since the concurrency levels in the two ^sterns are the same, and the results 
returned are identical wi^ the exc^ton that a sequence of previously invoked 
transitions have been encoded in the transiti<xi8 generated in the atomic system, we 
claim that the two systems have the same t)ehavior. The same implementations can 
tie used to implement the two systems. The only dHfererK^ between the two is the 
nuxlelling of the acceptable behavior of Vhe ^atem. 



166 



6.2 Formal Proof of Power of Atomicity 

This section presents a more formal version of the argument described in the last 
section. Atomicity and consistency are defined more formally in sections 6.2.1 and 
6.2.2. The formal proof is in section 6.2.3. 

6.2.1 Atomicity 

Some terminology is needed before presenting the definition of an atomic system. 
Suppose h is a sequence of events, r is an object, and a is an action. We define h|r 
to be the subsequence of h involving r and h|a to be the subsequence of h involving 
a but not a's sub-actions. An event in a sequence h is committed if there is a commit 
event of the same action identifier in h. We define cofninitted(h) to be the 
subsequence of h that involves only invoke and retum »/ents that are committed. 
Aborted(h) is defined similarly. The sign "t|" denotes concatenation of sequences. 
We will omit the concatenation signs for sequences wtienever it is convenient. For 
example, t^tj... refers to t^ftjl.... Also, we v^ti use ttte "€" sign to refer to an 
element being part of a sequence. So for example, we say tj € t^ tj— • 

A sequence of events h is weH-formed if it satisfies the following conditions: 

1 . Ignoring commit aruj a^rt events, the sut^eqi^nce h|a sfiould have 
alternating invoke and r^urn events, starting with an inwske ev«it, and 
with each pair involving the same otHect. 

2. committed(h) and aborted(h) do not have any common events. 

3. If a commit event of an action a app^m in h, tttm h|a consi^s of an 
alternating sequ^fice of invoke and return events (storting with an invoke 
event and mding with a return event) and some commit events at 
different objects. 

A well-formed sequence of events is called a history. 

We define a function Serial which takes a history an6 a linearization of ttte acttons in 
that history as inputs, and returns the history rearranged according to the 
linearization. More formally, if an actk>n a or an ance^or of a is prior to another 



167 



action b or an ancestor of b in the linearization L, then h|a precedes h|b in SeriaKh, 
L). The order t)etween events of a and events of a's sub-actions is preserved in 
SeriaKh, L). 

We define Globally.Atomic.Objects as the set of giot>alty atomic objects in tfie 
system. A history h is globally atomic iff: 

3L VfjCGiobally.Atomic.Objects: N,(l,, Seriai(committed(h|r,, L))) ^b X 
where L is a linearization for actions in h, 

N, is the state transition function of ttie serial specification of r,, 
I, is the initial state of the state machine^ 

A system is atomic if it generates only atomic histories. 



To simplify our proofs, we will ignore nested actions. Hence, instead of a 
linearization, only a total ordering of the computations in a hi^ory is needed. We will 
also limit a history to t)e a sequence of transitions »id commit and abort events. In 
other words, an invoke event must be followed immediately by the corresponding 
return event. Transitions from different (x>mputations can still be interleaved. The 
limitation is imposed to simplify the mapping between histories in an atomic system 
and a consistent sy^em. The simplification does not make any difference to our 
results as the positions of the invoke events in a history are irrelevant. 

Without failure atomicity and nested actions, the set of atomic histories can be re- 
defined as follows: a history h is glot>aHy atomic iff 

3L Vr,€Globally.Atomic.0bJect8: N,(l,, Serlai(h|r,, D) ^ X 
where L is a total ordering for computations Hi h 

Notice that since we assume that every transition te committed, no commit or abort 
events need to appear in li, which becomes a sequence of transitions. 

6.2.2 Consistency 

To distinguish the ot)jects in a consistent system and an atomic system, we use the 
symbol r^, to refer to an object in a constetent system, wfiere C in the sut)script Ci 
refers to the set of conflicting transitions pairs. 



168 



C = { (t1 , t2) 1 11 and 12 are conflicting transitions of some object r^, } 

We assume that there is some mechanism for an application to define C. 

The semantics of each object r^, is defined with a ^ate machine similar to those 
used to define serial specifications of atomic c^i^jects. The state machine used to 
define the semantics of r^| has four components: H^^, S^,, Iq|, aruj Tq|, 
corresponding to N,, S,, I, and T, in a serial specification. 

A history h^ is consistent iff: 

°Chj. '* acyclic and Vr^: NcOci. hclr^,) ^ ± 

where G^^ s { (Comp^^, Comp^i^) € Computationsih^) X Computations(h^) 
such that hf. = ...tQjj...tjj,j..., (tQ^^, tQ^)€C, tQ^jCCompQ^,, 

tcb€Coinpcb, Compc, '^ Compp J 
Computations(h^) x set of computations that appear in h^ 

In the definition above, G^^ is a graph of edges between the computations that 
appear in h^. An edge exists between two distinct computations Comp^^ and 
Compel, iff they have executed a pair of conflicting transitions t^, and 1^,,. To make 
sure that conflicting transitions executed t}y different computations are not 
interieaved, G^,, must be acyclic. Furthermore, the transitions must t>e valid 
according to specifications of the objects in the sy^em. Notice that there is no 
global total ordering governing the order in which computations C4>pear in h^lrQi. 

6.2.3 Proof 

Suppose a consistent system is d^ined with a set of oxiflicting transitions C, the 
objects r^,, and the specifications of these objects, which are in turn defined by N^, 
Sq,, Iq,, and Tq|. Our goal is to construct an equivalent irtomic system defined with a 
set of equivalent objects r, and the serial specifications of ttieae ot^ects, which are 
defined by N,, S,. I,, and T,. A 1-1 mapping M will be defined to map hi^ories in the 
atomic system to those in the consistent ^stem. The s^ erf atomic histories in the 



168 



atomic system should map to the set of consistent histories in the consistent system. 

We will first show how the serial specifications in the atomic system are defined. 
Then we prove lemma 1 which stat^ that if a history h^ is consistent then the history 
M'^(h^) in the atomic system is atomic, and temma 2 which states the reverse: if a 
history h is atomic then the history M(h) in tfie consistent system is consistent. 

Construction of Serial Specifications in Equivalent Atomic System 

In our informal version of the proof, we argued that for each transition t^^ that 
executes at the ot>ject r^, in the consistent system, it is necessary to encode the 
entire history of transitions that execute at r^, b&fare Xq^ in t^. The set of equivalent 
transitions T, at the equivalent object r^ can be defined as: 

T| = Tg, X Tjj,* 

where T^| is the set of possible transitions in the object r^i, 

T^i * is the set of all possible sequences of transitions in T^ 

The first component of a transition t^ in T, corresponds to the equivalent transition 
*Ca '" ^Ci' ^^® second component encode the sequence of transitions that were 
executed at r^, previous to t^^. To mai^e sure that the second component does 
encode such a sequence and the histories in the atomic system tms a 1-1 mapping 
with those in the consistent system, we constrain the set of histories H in the atomic 
system to t3e co/>erenf: 

1 . if h = ...t...i € H 

and t, s (tc,, tsjj,), tj^€Tg,, tSc,€TQ • 

and Vt^ B (t^^, ts^^) such that h s ...t^...t^...: tQ^CT^i 

then ts^^ s o 

(i.e., if t^ is the first transition ttiat beloncp to r, in h, then the second component of t, 

should be an empty sequence.) 



170 



2. If h = ...tg...t^... € H 

andtg = (tea' *»ca^' *b - <*Cb' *^Cb^'*C8' *Cb^"'"ci'**Ca'**Cb^^Ci* 
and Vtjj = (tg^, ts^jj) such that h = ...t^...t^.,.t,j...: t^d^^Ci 

thentscb = t«call*Ca 

(i.e., if tg and t^ are consecutive transitions that belong to the same object, then the 
second component of t|, should be the concatenation of the second and first 
components of t^.) 

The coherence requirenrient is an SKiditionai requirement that we need to impose on 
the atomic histories because it can not be exr»ressed with the serial specifications. 
Since the coherence requirement deals with histories rather than serial histories, it 
exposes the concurrency in a system. When serial specifications are used to reason 
about the tjehavior of a system, concurrency can be ignored. Thi^ is not true for the 
coherence requirement. In section 3.4.3, we have tattced e^x>ut a similar requirement 
that require exposing the concurrency in a sy^em. In that section, we descrit}ed a 
lowerjoound_balance operation on an account ot>iect. In order to guarantee that an 
implementation does not return trivial results, such as zero, we require that a result 
has to be one of the pc^sible results given the many possibilities of serialization 
orders and operation outcomes. Since this guarantee is a separate requirement from 
the serial specification, we cannot assume any non-trivial results when we reason 
about tf>e behavior of lower.bound.balance using only the serial specifications. 

Given that histories in H are coherent, there is an obvious 1-1 mapping M and its 
reverse M'^ t}etween H and H^, the set of pos^ble histories in Vne consistent system: 

MMtjj^, tSjja) H (»cb' **Cb^ I — ) = *C«*Cb— 

•^■'<*Ca*Cb"> = <*Ca' <» H <«Cb' <» B ••• " *Ca^Tc,, tcb^Tcj. I'*] 
ftca' <» ■ <«Cb' *Ca> I ■ ■ " *C^Tc, tcb€Tc 

We will reuse the symt)ols M and M'^ to stand for the obvious mappings t)etween the 
computations in h and h^, or the mappings between Q^ and a corresponding 



171 



graph in Computations(h) X ComputationsCh). For notational convenience, we 
assume: 

Vh€H, VlgChrtg = (*ca'**C«^ 
Note that if t^^ appears in h^ at the object r^|, then ts^, is the concatenation of all 
the transitions that execute t)efore t^^ at r^|. In other words, tSc^H^ca '^ ^" initial 
subsequence of h^fr^^. 

We now proceed to finish our definition of the state machine of r, by defining S, (the 
set of states), I, (the initial state), and N, (the state transition function). 

Lets, = Tc,* 

i,«o 

crltlcaKt^, tsc^) = {tcx€t8cd I (tcx- W^^} 

^i^*®Ca' *b^ " **cJ'*Cb '** crltlcaKt^, tB^) Q criticaKt^, t8g„) 

andNc,(lc,,tSc^|tcb)*-L 

In the definition of N, at)ove, two conditions have to be satisfied in order for 
N|(ts^^, t,,) to be defined. The first conditicm requires that criticaKt^, ts^^) is a 
subset of critical(t|,, ts^^). in other words, all the conflicting transitions that 
execute tiefore X^^ are serialized before t^. The secorKl condition requires that 
'^Ci^'ci' *^Cbtt*Cb^ ^ defined. In other words, the transition t^ must be valid 
according to the local execution order at r,, since thte is required in the consistency 
definition. 

The following two lemmas will show that a history h^ is atomic if and only if the 
equivalent history M' ^ (h^) is consistent. 



172 



Lemma 1 : if h^ is a consistent history then M'^(h^) is an atomic history 
Proof: 

Suppose h^ is a consistent history, let M'^ (h^) s h 

Let L be a total order of all the computations in Computationsih) 

such that it is consistent with M'** (Gq|, ) 
Suppose Seriai(h|r|, L) s Vb"'*k-i^k 
We will use induction on i( to show that N|(l|, Serial(h|r,, L)) ^ X 

Basic Step: 

From the definition of criticat, we itnow: criticaKt,, O) » 

=> criticaKtg, <» = £ critlcaKt,, tSj.,) 

Also, since ts^^|t^^ is an initial sut)sequence of ^fJi^Qi 

and Ncidc,. hcl'ci* "^ -^ 
=* Nc,(lc,. t8caB*C> * -L 
Hence N,(0, t^) « t^j^ 9* ± 



173 



Induction Step: 

Suppose N,(l,, tgt^...t,^.,) 5* ± 

From the definition of N,, we know: N,(l,, tgtjj.-.tj^ .,) s tQ^tg^...tj.,j,^ 

Suppose tjj € criticaKt^, *ca*Cb"'*Ck-i^ 

=* *Cx ^ <Ca*Cb"*Ck.i and (tcx. «ck^ € C 

=> (Comply, Comp^,^) € G^.^ and (1^,^, t^^^) € C 

""^ n^ s ■■■l^ ...l^- ,.. c^O '*/^w9 *d»' WW 

=>tc,et8c„and(tc,,tc^)€C 

=> tgjj € critical(t|^, ts^jj) 

Hence critical(t|^, *ca*Cb—*Ck-i^ ^ crltlcaKt,^, t8g,j) 

Also, since tSQ,^QtQ,^ is an initial suttsequence of tiQfrQ, 
andNjj,(lj.,,hJr(y)*X: 

'^ *^l^*Ca*Cb— *Ck-1' V * -*- 
=*N,(l,,t,t^...t^)>X 

Hence h is an atomic history 
QED 



174 



Lemma 2: if h is an atomic history then M(h) is a consistent history 

Proof: 

Let hg = M(h) 

Suppose h^ is not a consistent history 

=* 3'ci 3tca€hc|rc,:Nc,{ic,.t8c3|tc,) = -L 

or a cycle of transitions exists: 

^ ^*Cm2' *Ca1^' ^*Ca2' *Cb1^' — ' ^*CI2' *Cinl' ^ ^ 
S.t. h^ = •"*cm2""*Ca1""' ^'c " ■"*Ca2""*Cb1 ' C " " CI2*"*Cm1*" 

and tc, , tca2 ^ Compc,; t^bi . tcba € Compcb; ...; t^^, , i^^ € Compc„ 
Suppose 3r^ Btca^hjr^i: Ncdcr *»c«l*C«^ " -•- 
=> 3r, 3t,€h|r,: Nc,(lc,. t»cJI*Ca> = -L 
s» 3r, 3t,€h|r,: N|(s, t,) s X for ail pc»8it)le sCS, 
s> h is not an atomic history, contradiction 
Suppose the cycle of transitions exists. 
Since h is an atomic history 

s^ 3 a total order L of Computation8(h) s.t. Vr, N|(l,, Serlai(h|r,, D) ^ J. 
'^ 3(Compf, Comp^>€Ls.t. (t^j^j, tjj,,)€C, 

*C«2^^'"Pce' *Cf 1 ^^^*""Pcf ' ^"^ ''C ' — *C«2""*Cf1*" 

^ t,, € Prefix, where SeriaKhtr,, L) s Prefix 1 1,, I Suffix 

=^ tc„ € crltical(t,2' ^i^'f P^««xW 

Since *Cfi^*»C«2 

=* t^j,, ^ critical(t^2« *'c«2^ 

=* N,(l,, Prefix I t^j) ■ X 

s> h is not an atomic history, contradiction 

175 



Hence h^ is a consistent history 
QED 

From Lemmas 1 and 2, we know that given any set of objecte r^,, their specifications 
which are defined with N^,, S^,, l^,, and T^,, and a set of conflicting pairs of 
transitions C, we can construct an equivalent set of objects r,, their serial 
specifications which are defined with Np S,, i|, and T,, so that: 

h^ is a consistent history iff IM'^(h^) is an atomic history 

6.3 Objects with Simple Serial Specifications 

With lemma 1 and lemn^a 2, we have shown that atomicity is at least as powerful as 
the consistency definition. H<^vever, the serial specifications that we have 
constructed above are impr»:tical in that they require encoding the entire previous 
history in a transition. The more complicated a serial specification t)ecomes, the 
more difficult it is to understand. Thus, although atomicity is as powerful, it is not 
always easier to understand, in this section, we will argue that the serial 
specifications can be simplified in many cases and still have the same behavior and 
concurrency. In particular, we will show that for a particular class of objects in a 
consistent system, their specifications can be used as the serial specifications for 
their equivalent atomic objects. hk> complicated artificial serial specifications have to 
be constructed. Since ttm specifications in ttie two systems are just as easy to 
understand and the conc^ of atomicity is easier to understand than the concept of 
the (x>nsistency definition, we will claim that our approach is preferat)le. 

We will first define this class of objects, which we call accurate objects. Then we 
prove a lemma which shows that the set of consistent histories is a sut>set of the 
equivalent atomic histories wtten accurate ol^ects reuse tiie specifications of their 
counterparts as serial specifications. Finally we argue ttiat the class of accurate 
objects is a lar£^ class. 



176 



6.3.1 Accurate Objects 

Ignoring the requirement that a consistent order must exist, the only difference 
between a consistent system and an atomic system is that the former can execute its 
transitions in a local execution order, whereas the latter has to make its transitions 
appear to be executed in a global serialization order. In general, this results In less 
concurrency for the atomic system. Informidly, because a pair of transitions may not 
"commute", an implementation of the atomic system may create conflicts in the 
process of making sure that the pair appears to execute in the serialization order. A 
pair of transitions t^^ and t^i^ commutes If: 

Vhc.hc'€Tc,* 

Nc,<»ci' he B tc. > *cb a »»c'> = J- "♦ Nc,(ic,. he I tcb li tc, H he') * -L 

Conslder an object r^j in a consistent system with the property that all non- 
commutative pairs of transitions are conflicting. Suppose we construct an equivalent 
object Fj In an atomic system using the specif k^tion of r^i as Its serial specif k:ation. 
Suppose a transition t2 Is executed after a transition t1 . There are two possit)le 
scenarios: either t1 and t2 commute or they do not. In the first scenario, since t1 
and t2 commute, no conflicts will be created In ather s^em. Regardless of the 
serialization order or the consistent order, the transitions t1 and t2 will be valid. In 
the second scenario, t1 and t2 do not commute. In a consistent system, t)ecause 11 
and t2 are also conflk:ting, t2 can only proceed if the implementation is sure that t2 
is ordered after t1 In the consistent order. Reusing the consistent order aa the 
serialization order, we can achieve the same concurrency In the atomic system: t2 
can only proceed if the ImplementsMon is sure that 12 Is ordered after t1 in the 
serialization order. 

This property of r^i can be defined more formally as fottows: 
V*C.' *Cb ^ Tc,: if Nc,(le,. h^. 1 1^ | t^t ■ M * "^ 

and Ncdcp **c ■ *Cb B *Ca ■ **c') * -•- «©' •on>« hg. **c ^ ^Ci* 

then (tc,, tcb) € C 



177 



r^j has the property that whenever a pair of transitions does not commute, then it is 
conflicting and t}elongs in C. We call r^, an accurate object. 

Notice that commutativity depends on the definition of N^,. For example, suppose 
the specification of a bank account object is defined with the state machine in figure 
6-1. This specification is similar to the one vte defined in figure 3-1 except that 
insufficientjunds may be returned even when the balance is more than enough to 
cover the withdrawal. The motivation of this non-determinism is to allow a 
pessimistic reply to be returned immediately instead of being delayed by tentative 
updates. 

in the state machine in figure 6-1, the only pairs of transitions that do not commute 
are (read.balance.x, deposit.y.okay), (deposlt.y.okay, read.balance.x), 
(read.balance.x, withdraw.y.okay), (wfthdraw.y.okay, read.balance.x), and 
(deposit.x.okay, withdraw.y.okay). The transition pair (withdraw.y.okay, 
deposit.x.okay) commutes since the extra (teposit does rK>t invalidate the 
withdrawal. Also, the transition withdraw.x.insuf commutes with all other 
transitions, even though "normally" we would expect it not to commute with 
deposit.y.okay and withdraw.y.okay transitions. 

Sq,: real numbers 

T^i: <deposit(x), r^,, aXokay, r^,, a> & deposit.x.okay 
<withdraw(x), r^|, aXokay, r^,, a> s withdraw.x.okay 
<withdraw(x), r^|, aXinaufficient.funda, i^, a> » withdraw.x.insuf 
<read.balanceO, r^,, aXx, r^, a> s r«ad.x 
where a is a computation, xis a positive real number. 

N^,(8, deposit.x.okay) a s -i- x 
N^,(s, withdraw.x.okay) s s-xifs>x 
N^|(8, withdraw.x.insuf) s s 
Nq|(s, read.x) s s if a a x 

Figure 6- 1 :Specification of a Bank Account Object in a Ck>nsistent System 



178 



6.3.2 Specifications of Accurate Objects Can Be Reused 

We will show that if r^. is accurate and the serial specification of the equivalent 
object tj is defined as: 

1. s Ipj, S, = Sq, T, = Tjj,, N, a Ng, 
then the set of atomic histories includes the set of consistent histories. An 
equivalence in tiehavior and concurrency is achieved without defining artificial serial 
specifications for rQ|. Rather, the same specif ication used in the consistent system is 
used. 

The current consistency definition precludes the two sets oif histories from tieing 
equal. However, the stronger requirement of equality is not necessary £» histories 
that are atomic t>ut not consistent are indi^nguishabie from the other atomic ones in 
the sense that ail the atomic histories can be generated by some serial execution. 
Equality can be proved if we use the following more general consistency definition: 

Hq is a consistent history iff 

Gqj^ is acyclic and Vr, 3L,: Nq,(Iq, Serial(hQ|rQ,, L|)) * X 

c 

where L, is a total order of the transitions in h^|r^ 

Gqj^ b { (Comp^^, Comp^i^) € ComputationsChQ) X Computations(h^) 
such that (tg^, tQ^)€L, for some I, (t^^, tQ|,)€C, tQ^,€Comp^j^, 

tcb€Coinpg^, Compjj, * ComPc } 

Using the new definition does not change our prevtous results except that N, in 
section 6.2.3 has to be redefined. In the following proof, we will use the old 
definition. 

Lemma 3: if h^ Is a consistent history then M'Vh^) is an atomic history 

(The mappings M and M'^ can be extended in the otwious way. For example, 
suppose Xq^ is a transition of an accurate obiect whereas t^^ and t^^ are not. 



179 



'^"*Ca'*«Ca^H*Cbll<*Cc'*«Cc^H ■•) = *Ca*Cb*Cc- 

(«Ca' <» B *Cb B <*Cc' *Ca> « "• " «Ca^Tc,, tc^eTc, 

If all the objects In the system are accurate, then M and M'** t}ecome the identity 
mapping.) 

Proof: 

Let Commutative, C T^,* X T^,* s.t. (h^p h^') € Commutative, iff 

2. Np(i^,, tt^,') ^ ±, and 

3. he, = hlltJt^Bh'. he' « h|H2||t,|h'wheret,,t2€Tc,.orhc, « hp,' 
Let Reachable, be the transitive closure of Commutattve, 

Suppose h^ is a consistent history, let iM'^ (h^) « h 

Let L t3e a total order of all the computations in Computatlons(h) 

such that it is consistent with M'^(Gqi^ ) 

c 

For non-accurate ot)iects, we can show that N,(i,, SeriaKh|r,, L)) "^ JL as 

before. 
For accurate otqectsr^,, let h^lr^i « lifr, s t^tj-.t^.^tj^ 
In the rest of the proof we will use induction on ic to show ttiat: 
( SeriaKt^ ...t,^, L)^,^ ^ , ...t^, '^cl*^ci ^ ^ Reachable, Vic « 1 ,2,..,m 
In particular, since it is true for ic s m: 
==> (Seriai(t^ ...t^, L), h^lr^,) € Reachabie, 
=» NQ(ijj,, SeriaKt, ...t^, U) ?* ± 
=» N,(i,, SeriaKhfr,, U) i» J. 
=* M' ^ (hg) is an ^omic hi^ory 



180 



Basic Step: k ^ 1 

It is obvious that (t^ —t„, '^ci'^Ci^ ^ Reachab(e| as: 

Induction Step: 

Suppose ( SeriaKt^ ^..t,^, L)||t,^ ^ ^ ...t^, h Jr^ ) € Reachable, 

LetSerlal(t^...t,^, L) x u^.-.u,^ 

LetSerial(t^...t|^^^,L) s u^...u.t,j^^u._^^...u,j 

From the definition of L, we know: (t|, + , . Uj + 1). — , ^^^1^.1* *>i) ^ C 

=*> Nc,(lc,. "i-"k.i«k*i Vk*2-*m> * -L 8«"ce fc is accurate 

=^< "r -"k-i^^iUk^k^a"*!!.' "1 -"k-iVk^i^k^a-*™, )€Reachable, 

**< "i"»j*k*i"j*i""K*k*2-V "1 •"k.i"k*k*i*k*2-*m )€Reachable, 
=> ( SeriaKt^ ...1,^ ^ ^ , Dfltjj ^ 2—*m' ''d'^ci ^ ^ Reachable, 



QED 



6.3.3 There Are Many Accurate Objects 

There are three possit)(e kirtds of pairs of non-commutative transitkxis: 

1 . mutator - otiserver 

2. mutator - mutator 

3. observer - mutator 

Notice that case 3 is different from case 1 because a mutator transition and an 
ot>s«ver transition can be defined as conflicting if tii^ exeojte in one order but 
non-conflicting in the ottier order. We will argue tfmt in most cases, an application 



181 



would define the three kinds of non -commutative transitions as conflicting. Hence 
most objects are accurate. 

Mutator - Observer 

The main reason for a mutator-ot)server pair to be conflicting is that there is no 
concurrency gained by making them non-conflicting. Typically, when a mutator- 
observer pair does not commute, the validity of the result returned by the ot)server 
also depends on the outcome of the mutator. Consequently, because the observer 
has to be delayed in any case, making them conflicting does not cause any loss in 
concurrency. 

The bank sK^count object with its N^, defined in figure 6-1 can be used to illustrate 

this argument. Suppose the account object has the foltowing pairs of transitions in 

C: 

(read.balance.x, deposit.y.okay), (read.balance.x, withdraw.y.okay), 
(deposit.y.okay, read.balance.x), (withdraw.y.okay, read.balance.x) 

These conflk:ting transition pairs in C prewnt audit computations from interleaving 
with fund tranter computations. However, because (deposit.x.okay, 
withdraw.y.okay) is not in C, the account otjject is not accurate. We will show that 
no concurrency is gained by mid(ing the account object not accurate. 

Conskler an implementation of a consistent system in which an algorithm similar to a 
dynamic concurrency conti^ algorithm is used to (^jsrantee that a consistent order 
exists. An incoming trsufisition t is (Mayed until «iy prevkxjsiy exeojted transition t* 
is finalized if (t*, t) € C. Also, to gus^antee that N^O^,, h^,) * ±, a 
withdraw.x.okay transition is generated only when pre^ous committed deposits in 
h are sufficient to cover the unat)orted withdrawida. A withdraw.x.insuf transition 
can be generated anytime without creeping any ocMifiicta. 

The same implementation can be used if we d^lne the account object as atomic with 
Nqi as its serial specif teation and use a dynwnk: concurrency control algorithm. This 



182 



is true despite the fact that successful withdraw transitions and deposit transitions 
are not commutative. Two factors contribute to this equivalence. First, the 
implementation has the property that the conflicting transition pairs in a history 
generated by the implementation are ordered by their commit timestamps. Second, if 
a withdraw transition depends on some previous deposit transitions, it must be 
committed only after they are committed. Consequently, if we compare the actual 
execution order and the serialization order, a successful withdraw transition is 
ordered after a deposit transition in both orders if the withdrawal depends on the 
deposit. 

To present our arguments more rigorously, con^der a sequence of transitions 
s = u^...U|.djjWv^...v^ such that 

where d^ is a deposit.x.ofcay transitton, 

w is a withdraw.y.okay transition. 

Consider the sequence with the two transitions d^ and w reversed: 
s' = u^...UjW djjV^...v^. 

Since N^iddi s) ^ ± 

Also, /'N^iOqi, u^...UjW ) * ± 

tlien N^|(Iqi» s') ^ -L, since tiie Vj's are not affected by the order of the 

Withdraw and deposit transitions 

If the system is implemented with the dynamic cygorithm that we descrit)ed above, we 
know that the order in which the computations commit, L, is consistent with Gq^ . 
Obviously, either w is committed after d^^ or d^^ is committed after w^. If the former 
is true, we know that w is sericrfized after d^ acctMtling to L and we would not have 
to "switch" Wy in front erf d^ during the induction stop in lemma 3. In other words, 
we do not have to worry about the validity of s*. 



183 



If d is serialized after w , it must be uncx>mmltted when w is executed. 
* y y 

Furthermore, due to the property of the concurrency control algorithm, all the 
committed deposits at the time w is executed must be r^resented in u^...u^. 
Assuming that the t}ank object cannot predict whether uncommitted deposits will 
commit, it implies that: 

Ck}nsequently, we know that for any consistent h^ generated by the implementation 
that we described above, (SerialCh^jr^i, L), hckQi) € Reachable, despite the fact 
the account object r^, is not accurate. Making (deposit.x.okay, withdraw.y.okay) 
non-conflicting does not gain any concurrency. 

Mutator - Mutator 

Before describing the reasons why a mutator-mutator pair should be conflicting, we 
should ot>serve that there are many mutator-mutator pairs that commute. For 
example, all the mutators in the bank account example commute with one another 
because increments and decrements commute. Similarty, in an airline resovation 
system, increments and decremaits of seat a>unte commute with one another. The 
concurrency protjlem that we encounter in theee applicatk>n8 is usually due to 
conflicts between ot}sorvers and mutiAors. 

Nevertheless, there are also many exsunpies in which two mutators do not commute. 
One of them involves an "overwrite" transition, such as resetting the value of a 
counter, which does not commute with other mutator tran^tions. In a caierKlar 
application, changing the meeting place of a meeting aF^Qointment does not 
commute with another transition ^at changes the meeting place of the same 
appointment. In a FIFO-queue, the order in which items are enqueued determines 
the order in which items are dequeued. Two eiKiueue tran^ons do not commute. 

There are several reasons why these non-commutative transition pairs should be 
conflicting. First, making them conflicting is tiie only meivis to maintain consistency 



184 



within a set of objects. For example, in a replicated object, if a computation that 
performs an "overwrite" operation at each replica can interleave with other mutator 
computations, the state that results at each replica is no longer consistent. This is 
probably not acceptable to the application. Similarly, if two computations that 
change the meeting place of a nieeting appointment are executed concurrently, the 
desirable behavior is to serialize the mutators at each participant calendar in the 
same order, so that at least all the participants would go to the same place for the 
meeting. Making the tran^ons that change the meeting place conflicting is the only 
way to guarantee such a behavior. The question c^ why there are two such 
computations initiated concurrently in the first place should be left for artMtration at a 
higher level. 

Second, making two mutators non-conflicting does not improve concurrency In many 
cases. In the implementations that we have described in prevkHis chapters, the 
valkJity of the results of two mutator transitions does not deperKl on the outcome oi 
other transitions or the serialization order. For ex»nple, both inserting an item x and 
removing x from a set d3iect return okay in any case. It '» only when th&re are other 
ot>server transitions v^c»e validity depends on the siMializirtion order or outcomes of 
these mutator transitions that conftk^ts may be crtuHmi. For example, in the 
implementation of a set ob^BCt in figure 4-3 on pa^B 100, the (Mfity condition under 
which a conflict is created t}y a dBlefM operation is when the delBte(x) operatton 
may be serialized between an insort(x) cH>eratk>n «fid a m0mb»r(x) operation that had 
returned frue. If the imfrienwmtation uses a dynonk; concurrency control algorithm, 
the only 8ttuatk>n that ^ich a condition can be met is when tiie insBrtix) operation is 
committed and the rrmmberfx) operation tervtative. In an implementation erf a 
consistent system, wheth«' a confttet woukt sriso be oreatad under such a condition 
depends on whether member(x) and (fel»t9(x) are conflicting, which we will discuss 
tielow. 

Observer - Mutator 



185 



In an atomic system, a conflict condition depends on the functionality of the 
application. In particular, whether a conflict is created t>y a mutator that executes 
after an observer depends on the functionality of the mutator arKi otjserver. For 
example, in the t}ank account example described in figure 6-1, no conflicts are 
created by any mutator that executes after the transition withdraw^x.insuf because 
insufficientjunds does not guarantee that the t>aiance is less than the amount to be 
withdrawn. 

Similarly, the relaxed semantics of insufficientjunds can be used to increase 
concurrency in a consistent system. A pessimistic answer can be returned by 
withdraw if there are tentative mutators. Given ^at insufficientjunds has a relaxed 
functionality, defining withdraw.x.insuf and d«posft.y.okay as conflicting does 
not increase concurrency over an aAomic ^stem. In other words, defining an 
observer-mutator pair to be conflicting may not increase concurrency because the 
functionality of the observer may have been relaxed to avoid conflict between a 
mutator-observer pair oX transitions. 

In summary, since defining each of the three possilaie type of non-commutative 
transition pairs as non-conflicting is untilteiy to increase concurrency, defining tiiem 
as conflicting does not decreeoe concurrency eittiw. Consequently, the set of 
£K:curate objects ^ Hl^ely to bo a large set 

6.4 Conclusion 

In ttiis chapter we have shown that ^omicity is at least as powerful as a consi^ency 
definition that is similar to some other correctness definitior^ proposed in the 
literature. By allowing serial specifications to be defbied by an application, a 
programmer can construct an atomic sy^em equivertent to a consistent astern in 
terms of its concurmrK^y aivi behavtor. hlowever, the sertel specifications of the 
equivalent atomic system are too oxnpHcated to sustain our daim that our atomicity 
definition is easin' to understand than the consMency deHnition. We showed that for 



186 



a class of accurate objects the specification used in a consistency system can be 
used as the serial specification in the equivalent atomic system. Since the 
specifications in the two systems are as easy to underhand and the concept of 
serializability is easier to understand than the cortcept of consi^ency, we claim that 
atomicity is at least as powerful and easier to understand in the case of accurate 
objects. We argued that the class of accurate objects is a large class because it is 
unlikely to have non-conflicting non -commutative tran^tion pedrs. 

This chapter finishes our discussion of concurrerusy. In the next chapter we will turn 
our attention to resilience problems in a system with long computations. 



187 



Chapter Seven 
Resilience 



When the execution erf a computation spans a long period of time, the probability of 
its encountering some transient failure increases. After a failure, a computation may 
have lost its program state (e.g. local variables) befcre the failure and be unable to 
resume its execution. Unl^» precautions are tarfcen to guard against these transient 
failures, a computation becomes nfx>re and more unlitely to be completed 
successfully when its length increases. Other than site crashes, transients failures 
also include deadlocks and invalid assumptions in concurrency control algorithms. 

Two kinds of resilience problems are dealt with in this chapter. The first kind of 
resilience problems is concerned with limiting the cunount of lost work when a failure 
occurs. The use of n^Aed actions is a partial sc^ution: aborting a sub-action in 
progress does not undo the sibling actions or the parent actton. However, using 
sub-actions alone is not sufficient, if a sut>-actk>n is aborted after it had finished and 
the abort is not initiated t>y the parent action, the parent action has to be aborted 
also. Since the execution of the sub-action may be non-detwministk; and have 
affected the subsequent executk>n erf the pment action, a mere re-execution of the 
sub-action is inadequate. Storing the modiftoatkxis at the sub-actton in stable 
memory only helps occasionally, as aborts may t)e caused by cteadlocks and invalid 
assumptions in concurrency control ^gorithn^ as wi^l as by site crashM. 

Ck>nversely, when an action is aborted, all its sub-actions have to be aborted also. 
Significant delay can be added to the response time when these sub-actions are 
executed at remote sites. Re-executing tiie idx>rtsd action but not the sub-actions is 
not acceptable in gwieral. The executton erf ttte aborted action can be non- 
deterministic such that a different set of suthactions may be created in the re- 



188 



execution. 

The second kind of resilience problems is related to communication. In a 
communication network where partitions are frequent, a message may never reach 
the destination site if reserKling from the origin site is the only measure to mask 
partitions. Conskler the communication path iaetween two sites to consist of 
switches linked by direct communication links. If the receiver or one of these 
switches cyr links is non-operational, a partition is created. Even though indivklual 
partitions disappear over time, and the sender site can resend the message 
repeatedly, the system may be partittoned in such a manner that the sender and 
receiver sites never establish a connection along which ail the components would be 
operational simultaneously (figure 7-1). A special case of ttiis situation is when the 
sender and receiver sites are connected to the comiminication network at non- 
overlapping periods of time. 




Sitel j \ Switchi } 1 Switch2 hYA Switch3 V-Y-f Site2 j 





Figure 7-1 :Partitions that Prevent CommunicsAion 



189 



With most current communication protocol Implementations, an end-to-end 
connection from sender to receiver is assumed. While switches may resend to 
recover from a transient failure, they currently do not have the capability to buffer 
messages for an extended period of time, so that the ultimate resending 
responsibility falls back on the sender. If partitions develop, these assumptions 
prevent successful communication. 

In section 7.1, we describe a checkpointing mechanism which allows a program 
interrupted by failures to restart itself at the last checkpoint. A "program" can be 
equated with a proctdura in a resource manfi^er. Cfrackpomting has been 
suggested in the literature [41 , 53] to increase the resilience of a computation; our 
goal is to work out a checkpointing mechanism compatible with the implementation 
paradigm descrit>ed in this thesis. In addition, because of our assumption that 
communication delays can be significantly long, we will discuss how to minimize 
aborting remote sub-actions by coordinating the checkpoints with remote 
invocations. Another difference tietween our work and otlier work on checkpointing 
mechanisms relates to the amount of information stored in a checkpoint. In order to 
avoid checkpointing every piece of information accessible to a program, we will 
d^ribe how the program can specify a subset of tte atete to be ixeserved across 
checkpoints. 

In section 7.2, we descriiae how messages can be relayed through message transfer 
agents (MTA's). The protocol between two MTA's (x* an MTA and its client is simple, 
minimizing the state that needs to be kept on both sUes. MTA's are capat>ie of 
buffering messages as w^l as storing messages in stedble memory so that messages 
are not lost with site crashes while waiting for partittons to disappear. 

7.1 Checkpoints 

This section descrit)es how a program can checkpoint its state during executk>n. At 
a checkpoint, all the updates to the shared objects accessed or created by this 



190 



program should be stored in stable memory. These shared objects include all the 
objects accessible from the permanent state of the resource manager. In addition, 
any objects local to this program (e.g., local variables) must have their updates 
remembered in a known location in stable nr^mory. ^nce it may be too expensive to 
copy all the accessible local state into stable memory, we wilt describe how the 
application program can specify a sulsset of the local ^te. Only objects in this 
subset are accessible after the checkpoint. 

Due to our decision that only a subset of the state accessible to a program is 
preserved by a checkpoint, and because a procedure is a mpre convenient unit than 
a process to specify the subset, we will equate a program with a procedure. 
Obviously, checkpointing only the state of a program is not sufficient. To guard 
against site crashes, all the ancestor programs on the catt stack at the same site must 
also be checkpointed. It may also be appropriate to extend ttie checkpointing 
beyond this site. 

Our approach may provide less availability than a systemi in which the checkpointed 
state is replicated in another site with relatively indeperxJwit failure characteristics. 
To determine the appropriate trade-off, availability should be evaluated against the 
cost and complexity of replication. Complexity can t>e reduced at ttie cost of special 
hardware support (e.g., dual-ported disks). 

In the remainder of this section we describe our checkpoint mechanism in greater 
detail. We will describe the actions taken at checkpoint time and failure occurrences. 

7.1.1 Checkpoint Time 

Our discussion of the actions taken at ciieckpoint tinne will start with a k)rief 
description of the local state that needs to be stored Isy a dieckpointing program. 
Storing the \ocsA state accessed isy a program is not encHigh to guarantee resilience, 
however. We ¥vili eriao discuss how the objects acceMed by previously invoked sub- 
programs can tie stored in stable memory, and how checkpoints can be propagated 



191 



to ancestor programs. 

7.1.1.1 Checkpointing a Program 

At a checkpoint, a program can specify a collection of local variables in a checkpoint 
record. Together with the p«rmn«nt stata of the resource manager, a checkpoint 
record constitutes the £K:cessible state after the checkpoint. 

Since abstract atomic objects of an applicatkm are eventually implemented using 
glot}ally atomic objects or k>caliy atomic c^sjects su^H^orted t)y the language system, 
storing the accessible state requires storir^ these system-level ot>jects into stable 
memory. For concreteness, we will assume ttmit the system-level objects are 
implemented using reed/write lopks aiKl storir^ the c^^ects into stable memory 
requires writing log Information that contains new values of modified objects into 
stable memory [44]. CXher algorithms are possiUe [48, 17]. 

When the log records that contain the \mlues of modified ejects are written out, they 
are associated with the corresponding checkpoint so that a consi^ent set of values 
can be restored after a failure. The ord^ in which toQ records are stored can be 
used to determine the order c^ cHfferent checkpoints tflrtten by a computation. The 
creation and preparation of a sub-action can be regarded as special checkpoints and 
ordered with otfier regular checkpoints in the tog. When a restart is needed later, the 
ordering in the log can be used to determine the latest ct'MKdcpoint to rollback to. To 
model checkpoints taken kiy paraiiel actk>r», an acyclic cHrected graph instead of a 
totel order can be used to nKxM tfie ordo". 

When a checkpoint is tsriien, an object checkpointed may be locked or a prevk>usly 
acquired lock may have been reiecfied. If the object is still k>cked, this can be 
indicated in the log record so tiiat the kx^k can be retained when the ob^ct is 
restored. If the lock is relei»ed, it is becmjse the object is a kx^ediy artontic object and 
the local computation that acquired ttie originai lock Nid ccMnmitted. If any changes 
made t^ the locally atomic computation had been written out to stabte memory, no 



192 



further work needs to be done. Otherwise, any changes made by the locally atomic 
computation, including the decision to commit the locally atomic computation, can 
tie flushed to stable memory. 

One complication remains. If a locally atomic object is checkpointed while a lock is 
hekl and the lock is subsequently released, it may not be possible to nsilkiack to that 
checkpoint because some other locally atomic computation could have accessed the 
object and possit>ly committed. One of the solutions is to disallow checkpointing a 
locally atomic object when it is locked. This is not a sev^e restriction because we 
expect checkpoints to be taken t)etween, and not during, short kx:ally atomic 
computations. Linguistically, a checkpoint can be tiriten as the end of a k>cally 
atomic computation, which forces locks to be reiefl»ed at the checkpoint. Another 
possibility is to discard the checkpoint as if it had never txsen done when locks are 
released later. The decision to discard a checkpoint can be written to stable memory 
together with the decision to commit the locally atomic computation and release 
locks. 

Log records at)out a checkpoint can be discarded when the actton in which the 
checkpoint is executed is finsJized^. 

Linguistically, in order to enforce the scope of the local variables so that the program 
after the checkpoint can only access those objects contained in the checkpoint 
record or p«raan«nt statt, we require the ixogram to continue in a separate 
program module after a checkpoint. We call this program nKxiule a continuation 
procedure, the name of which is stored in stable memory aruj associated with the 
checkpoint. The permanent state is accessible to ail program modules in the 
resourc* nanag«r. The checkpoint record can be made accessitile to the 
continuation procedure as its "argumerrts." See figure 7-2 for an example. 



If the only source of faHuree is site crashes, a checkpoint can be discarded once the action 
executes a tarter checkpoint or is prepared. 

193 



calendar ■ resource manager Is ... 
permanent state Is 

a: tabTe[s1ot] 



make_appo1ntment - procedure(. . .) 
local 1: Integer 



checkpo1nt(1oca11, ...) 
continue at contl 
end make.appolntment 

contl > procedure(c1oca11: Integer. ...) 

clocall... 

,a... 

end contl 
Figure 7-2:A Program Using Checkpoints 



7.1 .1 .2 Propagating a Checlcpoint to Previously invoiced Sub-Programs 

In addition to the local ot^ects accessed tsy this program, other objects accessed by 
the sub-programs previousiy invoked by this program should also be stored in stable 
memory. Since these sub-programs had already rstumed, no local variaisies need to 
be stored. Only the ot^rects in the permanent state d the resource ntanagers in 
which these sub-programs executed have to be written out to stable memory. If a 
sub-program and its parent execute at the same ^e, a ^gte ^able memory access 
can be used to write out sril the k)g records. If they execute on different sites, the 
parent has to send messages to inform the sub-program of the checkpoint. 

To simplify our discusskm, we assume that ail nmoto sub-programs are executed in 
sub-actions, if these remote sut>-acti<Mis have already prepared, no extra work is 
needed. Otherwise, prepare messages shoukJ be sank to ttie renuste sub-actions. If a 
no vote is returned by a 8ut>-action, this actton has to be rolled back to a checkpoint 
taken t>6fore the sub-action is creefted. We wiH discuss rollbacks in the next section. 



194 



It is not necessary for the parent to wait for a remote sub-action to prepare tjefore 
proceeding. However, wfien the parent prepares later, it has to make sure that the 
sub-action has also prepared. 

7.1 .1 .3 Two Kinds of Checkpoints 

Two kinds of checkfxjints are allowed in this proposal. The first kind of checkpoints 
is associated with a procedure call. Under our model, the length of a computation is 
attributed to communication delays. Consequently, if a program expects a long delay 
in the return of a remote procedure call, it should execute a checkpoint immediately 
after evaluating any arguments but tsefore the call. If the site in which the caller 
reskjes crashes during the wait, any previous work, such as calling some other 
remote procedures, and the ongoing call would not have to be edx>rted. Executing 
the checkpoint before the call minimizes the possitxiity that the caller will be at)orted. 
By associating the procedure call with the checkpoint, we guarantee that the 
checkpoint will be immediately before the call and the deterministfc processing in 
between would not invalidate tfie invoke message. 

The second kind of checkpoints is not associated with any procedure calls. These 
checkpoints are executed when a program arrives at some "logical tweaks." At 
these logical t>reaks. the remaining tasks in the program are relatively independent of 
previous tasks. Little or no local state is required to be stored for the continuation 
procedure. However, if we c^sume that a program spends relati\^y little time 
between remote calls, there is less motivatton for these checkpoints. 

When a checkpoint associated with a procedure call is executed, the arguments and 
a unique frame identifier of the caltee will be stored akmg wi^ other information in 
stable memory. A frame identifier uniquely kientifies a program. We assume that 
frame identifiers are unique over the lifetime oS a system. Storing the frame klentifier 
of a callee ensures tfiat a program is awaure of its waiting for another program to 
return when it is restarted. The continuation procedure will only be invoked wtten the 
procedure call finally returns. A handle can be provided to access the results of the 



195 



call in the continuation procedure. The use of frame identifiers will be discussed 
further in the next section. 

A program can anticipate the delay in calling a remote procedure and execute a 
checkpoint at the time of the call. On the other hand, a program can also delay the 
checkpoint until it is informed by the system of the difftoulty in communicating with 
the remote site. We expect the system to convey such difficulties through some 
system-defined exceptions. . In the discussion below, we assume that an 
unavailable exception is raised at a remote call when communication with the 
remote site is not possible. It is possible that the invoke message might have been 
delivered and the remote call is actually executing. 

The alternatives available to a program when an unaval labia exception is raised 
depends on the exceptton model. With a resumption model [36], a program can 
execute a checkpoint and resume the outstanding cedl. With a termination 
model [29], the outstanding call is abandoned. The resumption model has the 
advantage that the call will not be aborted if it had been, or will be, started. The 
program also has the choice of at)andoning the call, and pursuing some other 
alternatives, in which case the sub-action associated with the call will be aborted if it 
is ever going to be started. Aftor the checkpoint savi resumption, the state of the 
program is as if the checkpoint had bem\ anticipated. 

7.1 .1 .4 Propagating a Checkpoint to Ancestor Programs 

In the discusston atx>ve, we have ignored the interaction between a program that 
executes a checkpoint and its ancestor programs. In fact, the resilience of the 
computation is not much imprcwed if only the current program is checkpointed. In 
order to notify ttie caller of a checkpointing program, executing a checkpoint 
statement will also cause a special ^ceptton to be raised ir»kte the caller. At the r\sk 
of a slight misnomer, we can reuse ttie name unavallabla for the special exceptk}n. 
. Unless the caller had anttetpated the delay by a pronHous checkpoint, ttie caller has to 
provkle a handier for the exceptkm. To handle the exceptksn, the caller can deckle 



196 



to checkpoint its state and resume the callee. The exception can be avoided if the 
callee l<nows that the caller has a checkpoint associated with the call. 

If the caller did not anticipate the checkpoint and decide to checkpoint when it 
receives the exception, it would in turn cause an exo^k^n to be raised in its own 
caller.. Thus, checkpoints are propagated along the call chain (see figure 7-3). This 
propagating of checkpoints can be thought of as translating volatile stack frames irito 
a chain of "stable stack frames, " each of which consists of the folidwing: 

1 . a checkpoint record, 

2. the franw kientlfiers of this program and its catier, 

3. a continuation procedure 

4. the frame kientffier of the callee and the arguments of the call if the 
checkpoint is e»aociated with a pro<^ure call. 

During a checkpoint, storing updated objects and the stable stack frame into stable 
memory, notifying the caller, and executing ttie continuation procedure can €ril 
proceed in parallel. If the caller does not resume thte program, the current action can 
be dtxjrted asynchronously. The parallelism is needed as the caller may be from a 
remote site, creating a tong delay in notification. If the caller and callee are at the 
same site, their checkpoints can be synchronized in such a manner that the storing 
of their states into stable memory can be tHjffered in a single access to stable 
memory. On the other hand, there may be appltoattons that may prefer to minimize 
the probability of rollt>acks before atnling the continuation procedure. A 
synchronous checkpoint can be provkJed; the continuation procedure will only be 
invoked after tfie following has ha(^)erwd: 

1 . the caller has resumed thto program, 

2. the objects updated by this program aid its sub-programs have been 
stored in stable memory, 

3. the ixocedure csdl associ^^ed with the checkpoint has returned. 



197 



resume 
execution 
of a 



resume 

execution 

ofb 



resume 

execution 

ofc 



a s procedure(...) 

involve b 

except wlien unavailable: 

cliecl<point 

resume 



continue at Conta 
end 



enda 



b > procedure<...) 

invc^ec 

except when unavailable: 

checkpoint 

resume 



continue at Contt) 
end 



endb 



generate 
unavailable agnal 
at caller of a 



generate 
unavailable signal 
at caller of b 



generate 
unavailable signal 
at caller at c 



Figure 7-3:Prope^ting Checkpoints to Ancestor Programs 



198 



7.1.1 .5 Checkpointing Parallel Sub-Actions 

Consider when a checkpointing program is one of the parallel sub-actions invoked by 
a parent action. Like other checkpoints, the program has to supply a continuation 
procedure and a checkpoint record. The creator of these parallel sub-actions is also 
notified so that it can checkpoint if it had not anticipated the delay. In its checkpoint, 
it will remember the sub-actions that have not yet finished. Its continuation 
procedure will be invoked only after all the remaining sub-actions are finished. 

Parallel sub-actions can be used to specify an applteation time-out. Rgure 
7-4 describes a scenario in which a parent action creates two parallel sub-actions: 
one of them sends out requests to set up a meeting, the other contains a checkpoint 
statement and remembers a dewlline. The continuation procedure of the timer sub- 
action will sleep until the deadline is reeled. When the timer sub-action is 
awakened, it will abort the sit)ting action or perfonn othw necessary tasks. If the 
sit)ling action is finished t)efore the deadline, it will abort ttie timer sub-action and 
return. We assume that there sve mechanisms to abort sitting acttons. 

7.1 .2 Restart Time 

This section describes the process of restoring the state of a program to a 
checkpoint. First, the restarted>le {programs have to be klentified. This is not a 
straightforward operation es checkpoints can be asynchronous at different sites. 
Then the states oi the sites involved have to be restored to those recorded by the 
checkpoints and the programs associated wHti tiie checkpoints are restarted. We 
will focus on the case where the failure is caused by a ^e crash. Later we will 
describe variations to handle other types of failures. 

7.1.2.1 Identifying the Restartable Program 

After a failure, the system shoukl consult the reasrd of the checkpoints. The goal is 
to identify the last ched^point executed by a program whose caller is still expecting 
the program to return. If the failure is caused t)y a site crash, the system can retrieve 



199 



make-appointment = procedure(...) 

coenter 

remote- mark -subaction( . . .) 

timer-subaction(...) 

end except when available: 

checkpoint(...) 

resume % subactions 

continue at contm 

end 

end make-appointment 



remote-mark-subaction * procedure(...) 
... % invoke remote subaction 

checkpoint(...) 

continue at contr 

end remote-mark-subaction 



timer-subaction a procedure(...) 
... % calculate deadline and wait for 
... % short time tMfore checkpoint 
checkpoint(deadline) 
continue at contt 
end timer-subaction 



contm = procedure(...) 
If expired signalled 
then ... % abandon 

end 
erKJ contm 



contr - procedure(...) 
... % examine result of 
... % remote subaction 
abort sibling and return 
end contr 



contt » procedure(t: time) 

sieep-untiKt) 
abort sHsling and 
signal expired 
end contt 



Figure 7-4:Using Paralld Sub-Actions to Specify Application Time-Out 



200 



all the checkpoint records that belong to unprepared actions from stable memory. 
Recall that the checkpoints created by a program are ordered in their execution 
order and that sub-action creation and preparation can be regarded as special 
checkpoints. Only programs that were executed by unprepared actions need to be 
restarted. Programs that had returned before an ancestor program executed a 
checkpoint need not be restarted either. 

A program can be top-level if it executes the top-ievet action, in which case, the state 
of its caller, if the program has any, is irrelevant for recovery purposes. For the non- 
top-level programs that potentially need to be restarted, the frame identifiers 
recorded during a checkpoint can be used to id^itify their callers. A caller can be in 
a remote site and not necessarily checkpointed. If the caller is kx»l, one of the 
checkpoints of the caller should be associated with a procedure call and expecting 
this program to return. 

To determine whether the caller of a program has a checkpoint at the call or is still 
waiting for the call to return, a message has to be sent to a remote site if the caller is 
executing remotely. If a caller neither has a checkp<^nt at the call nor te it waiting for 
the call to return, the cailee should be asked to ato&rt. If the caller is still waiting for 
the call to return, no mwe work needs to be done and ttie cailee can restart, if the 
caller is not waiting for the call to return txit has a checkpoint at the call, the caller 
can continue up the chain and determine whether tfie caller itself can restart at that 
checkpoint. If the caller can re^art at that chedtpc^nt, the cailee can restart also. 

7.1 .2.2 Restarting a Program 

In order to restart a program as quickly as possible, two optimizations can be 
introduced. Rrst, the sending erf ax\ inquiry message to a remote caitor and a restart 
can proceed in parallel. This is crucial as tfiere may be a ksng delay before an 
answer is returned. Second, a call meaeaQB that invokae a remote cailee can 
indicate whether the caller is checkpointed at tfw call. If it is, no inquiry messages 
are needed later. Also, positive replies of aa\ IrKiuiry message can be saved and latCM' 



201 



inquiry messages directed to the same caller can be omitted. 

When a program is restarted at a checkpoint, all the work performed after the 
checkpoint, including any changes to the local objects and any sub-action created, 
should be undone or aborted. The values of the local objects are restored according 
to the values recorded by the checkpoint. See [54, 35] for a discussion of detecting 
orphan sub-actions that are still running even wtien they are supposed to be aborted. 
To avoid committing supposedly aborted sub-actions, the return, prepare, and 
commit, computation messages should contain the tree of action identifiers that 
ought to be committed. An actton should refuse preparation if the action tree 
contains sub-actions that should have been aborted. 

To restart a program on a crashed site, the continuation procedure eesociated with 
the checkpoint can be invoked directly if the ched(point is not associated with a 
procedure call. Otherwise, the program can re-invoke its caiiee. 

7.1 .2.3 Other Types of Failures 

Dealing with other types of failure is simitar. If an operation a is the victim of a 
deadlock, or a has made an invalid assumption in an optimistic concurrency control 
algorithm, the checkpoint tiefore a can be considered as the "last" checkpoint 
t)efore a "crash" (see figure 7-5). Ail work performed after the "last" checkpoint has 
to be undone. Determining this dieckpoint requires rememt}ering the ordering of the 
checkpoints and the points at which operattons occur. If thte is too expensive, the 
beginning of the action that a to exeojted in can be used as the last clieckpoint. 

7.2 Message Transfer Agents 

In the introduction, we described a communication prot>iem due to the improbatHlity 
of having all the components along a ccHnmunicitfon psiti operational at the same 
time. This section descrit)e8 how to alleviate ttie probi«n with Message Transfer 
Agents (MTA's). 



20Z 



ipai ^w»ipsijHipp»| ii i ! ^i^^ 



IPHCTPJipj 



iMlliMMiMlftiffiPIWii^iPJ 



GOmniMMGOMD 

•ndb 







••• 
••• 


ctAivnt.^ 




••• 


prognm 
coiffMir 




••• 

ma* 



1. 



Fi9ur«7*5:i 



to D ii ioom ar liwi i M Oip i nd t n cy Aammptions 



WMMfl hMI 



ittHlV 



/^ MTA proivMes btrfftrto«o of 
commtmicato wW) tt«9 DiMiwarillfaelly. A 
MTA'8 iaefoiv iwchino Mi fiml claiWnifcm, 
rsiaMtt is mom NMrty to tuooMMf thin moiiMmi 
oDWilioiMl rimuHiMoyiltf. iiiiMVitaci ttut 
''ciOier'*toNsdiilin«tHi. ttii 

amiiiioe oMr i a ort te w of thi mitti 
routina il0Ofiirim to iiliGt ivtiiy MTA'i giv«R ttw IfT A tilOM^ 



of ft miiii^^ cifHiot 

^^^ ^^^l^^^^^^l AAbdMk«''fcJMAk ^ikkki^^un^J 

WtfiMyiO flirotign aswini 

threuQii t0f9ni 

in ttii pilh to bi 

would git 9m miWiQi 

in tfytoiQ to iind 

ttiatttliii iiiomi 



a03 



If a destination resource manager has a fixed network address, the system can 
determine which MTA is "closest" simply by some table lookup. However, a resource 
manager can occasionally be relocated from one address to another. For example, a 
resource manager can be reincarnated in a different machine when a previous one 
crashes, and portable computers can be carried around and reconnected to the 
network at different locations. If the new address has not been propagated in the 
system, the table lookup may not return the closest MTA. 

This problem can be alleviated in two ways. Rrst, the source and destination of a 
message can be expressed in resource manager kjentifiers, instead of network 
addresses. Each relaying MTA can perform a table lookup for the b^t MTA to send 
to. Another possible solution is to allow each resource manager to specify a set of 
MTA's as its home MTA's. For example, a user may specify MTA's which are closest 
to his home or office as the home MTA's for his portable calendar resource manager. 
Messages can be ref^icated and sent to each of these home MTA's. Although extra 
resources are required for replication, these replicated messages euis otherwise 
harmless because they are detected t)y the destination resource manager. A home 
MTA that receives a message vmII try to send the messs^ to the destination resource 
manager periodically. A resource manager can also poll its home MTA's periodk^aily 
or when it is conscious of its t)eing reconnected to the network. 

To avokJ keeping messages in an MTA for an extended period of time and emptying 
complicated algorithrra to inform an MTA wiien meraages can be deleted, an MTA 
assumes that it can delete a message when its delivery has been acknowledged by 
the destination resource manager or the next MTA on the path. If the delivery is not 
acknowledged (e.g., the SKiknowledgment message is lost), the MTA can try another 
path without having to worry about a possitto replicated nf)essage which is harmless. 
In fact, a message can be replicated intentionally and relayed through different 
routes to increase reliability and minimize delay even when there is only one home 
MTA. To avokj k)st messages, messages can be stored in stable memory along the 
route. To avokJ an MTA being "stuck" with a message, each message is associated 



204 



with an expiration time and the message can be dropped when it expires. The sender 
of a message is responsible for resending when the message expires. 

Several other protocols [47, 22, 57} provide a similar relaying service. A Simple Mail 
Transfer Protocol which provides a relaying service across transport service 
environments for mail is described in [47]. Sites that are connected to different 
transport services are chosen as relaying points. An asynchronous data distribution 
service for general distributed applications for the SNA architecture is ctescribe in 
[22]. A similar service for the CCITT standard is described in [57]. 

The protocol we described above is not meant to be a complete specification but 
rather an outline df the main features. One of the feahires in our protocol is our 
assumption that a recipient can detect and discard duplicate messages. It allows us 
to simplify our protocol and increase rstiat^lKy by replicating messages. Also, an 
MTA can discard messages when they expire. It allows the resources of an MTA to 
be reclaimed easily. 

7.3 Conclusion 

This chapter described the resilience protsienr^ that a computation may encounter 
when partitions in ttie networit are frequent, in addHJon to the increased possibility of 
site crashes during the long execution of a computatton, there is also a higher 
likelihood of deadloclcs. To avoid a computation being at>orted whenever a failure 
occurs, a program can execute chedtpoints from which it can be restarted. We have 
described how the state of the program can be specified at these checkpoints, in 
view of the possit)le k>ng communication delay betwewi two sites, we have shown 
how their checkpoints can be coordinated. A program can execute a checkpoint in 
anticipation of or in response to a long delay in communication. A program can also 
inform its caller when it is performing a dieckpoint 

A different resilience problem euises when it is unlikely for the sender and receiver of 
a message to communicate synchronously. We described a relaying sendee which 



205 



has a simple protocol due to its assumption that duplicate messages can be detected 
by the receiver. 



206 



Chapter Eight 
Conclusion 



This chapter summarizes our work and suggests future work. 

8.1 Summary 

As the size and complexity of a system grow, it becomes more difficult to understand 

the t)ehavior of the system. Atomicity provkies a useful tool to handle this problem. 

In this dissertation we have investigated how long atomic computations can be 

supported. 

There are several questions that we tried to answer: 

I.H0W to improve the concurrency of a system with long atomic 
computations? 

2. Given that answers to the previous question may require application- 
dependent synchronization and recovery, how can the process of 
impienrienting an appiteation be simplified? 

3. Is atomicity the right model for long computations after all? 

4. How can a k>ng computation be r^iient to trwsient failures? 

Two solutions to the concurrency problem have been proposed in this thesis. The 
first solution involves the i^e of applicatk)n semanttes, which is not a new klea. The 
basis of the solution is to ctofine atomicity using the serial specifications of abstract 
objects, which are specifteations of the abstract objects' bcMiavior in an environment 
without concurrency or failures. As long as the external behavior of an abstract 
object appears to be atomic, how the object masks the internal concurrency and 
failures is immaterial. This approach of defining atomicity naturally leads to a trade- 
off between functionality and concurrency. By relaxing aerial specifications, 
concurrency is increased. Being able to trade off functk>nality for concurrency is an 
impcxtant requirement in a system with long comiHitatksns. Given that an 



207 



implementation cannot predict whether tentative computations will commit and that 
computations can be initiated asynchronously and interleave, a concurrency 
problem is unavoidable unless a "weak" functionality is used. 

The ability to define atomicity based on objects' serial specifications also makes 
atomicity at least as powerful as other correctness definitions that abandons 
atomicity. We have shown that given a consistent system [50], an equivalent atomic 
sy^em can be defined such that the set of atomic histories is identical to the set of 
equivalent consistent histories. We have also argued that in many cases, the serial 
specifications in the equivalent atomic system are identicai to the specifications used 
in the conastent system. (Consequently, atomicity is at least as powerful and easier 
to understand. This result assures us that our atomicity definition is a useful tod. 

In implementing an appiicatton, an application programmer is confronted with two 
problems. First, how can the serial specification of an ot^ect be defined such that 
there is "enough" concurrency? Second, how can sdtistract objects that t)ehave 
atomically be implemented? We introduced a conflict model that meeeures the level 
of concurrency with how frequent conflicts are created. We liave descrit)ed a 
process with which a programmer can derive conflict conditions from the serial 
specification of an object. Since a conflict condition is a useful indication of the level 
of concurrency in an implementation, the serial specification of the object can be 
designed accordingly. An impcKtant char»:teristk: of the a>nftict nKxIei is the 
masking of the undertying concurrency con^l algorithm. Hence, the designer of a 
serial ^lecification does not have to be knowledgeabie or aware of details of the 
underlying concurrency control aioorUhm. 

The implementation paradigm that we suggested for the imii^mentation of an atomic 
ot}ject follows the conflict model closely. When an operitfion is Invoked, it first tests 
vt^ether a conflict is created. If a conflict is created, it must be resolved. Otherwise 
the operatton can proceed. We emf^asize simplicity in our implementation 
paradigm. Not only do prograrra become easier to write, their correctness can also 



208 



be argued more easily. History objects are used to capture the necessary 
Information that determines whether a conflict is created. We described two 
recovery paradigms that govern how recovery is achieved. 

An important feature of a history object is that, similar to the conflict model, it masks 
the underlyingi concurrency control algorithm from the application programmer. An 
application programmer can write programs without having to know the underlying 
concurrency control algorithm and ite details. The programs written can also be 
ported on systems with different concurrency control algorithms. This p)ortability is 
important when systems with different algorithms may be merged. It is also helpful 
when little actual experience is available to determine the optimal concurrency 
control algorithm. We have shown how the programming interface can be 
implemented with differwit concurrency control algorithms. 

Another implementation mechanism suggested is the concept of local atomicity 
versus glot>al atomicity. By executing (short) portions of a globally atomic 
computation as kxally atomic computations, the programming of application- 
dependent synchronizatton and recovery is Amplified. A parallel with recurston can 
be drawn. The implementation of long atomic computations is simplified by making 
portions of them atomic to one another. The power erf the atomicity concept is 
reused at a different lev^. 

The motivation for these implementation mechanisms is to provkte a stylized and 
well-understood way of implerrmnttng atomic ejects. By using the history objects to 
derive conflict conditions, the recovery paradignw to perform recovery, and local 
atomicity to decompose synchronization and recovery, globally atomic objects can 
be implemented eeoily. 

The second solution that we provide to the concurrency problem is a limited one. We 
have designed two novel concurrency control aigorithn^ that minimize the 
occurrences cjf costly conflicts. These algorithms provkte a limited solution because 
they are ^ective only under special conditions. For example, for the hierarchical 



209 



algorithm, costly restarts and long delays can be avoided if distributed computations 
and computations that both observe and mutate are rare. 

Finally, we have discussed a checkpointing mechanism and a reliable message 
delivery service that alleviate some of the resilience problems. In view of the possit)ie 
long delay to communicate between two sites, we have shown how the checkpoints 
vwthin a computation can be coordinated. A program invoking another possibly 
remote program can execute a checkpoint in anticipation of, or in response to, a long 
delay in communication, it can also Inform its own caller so that its caller can in turn 
prepare for the delay. Due to the possibly long communication (telay and cost in 
accessing stable memory, the checkpointing process proceeds asynchronously at 
each site. 

8.2 Future Work 

In this section we will discuss a numt)er of cubbs for further investigatkm. 

8.2.1 Other Communication Primitives 

In this thesis, we have chosen RPC as the communication fMimitive. Although it has 
its limitations, such as in dealing with interactions that resombles coroutines, RPC is 
relatively more understood and familiar to programmers. The tree of call and returns 
also fit nicely with the nested actk}n tree. However, the requirement that each call 
must be paired with a return may pose some efficiency problem in an environment 
with long communication (telays. It is not unoMnnKm to have computations 
consisted of work that need to be done sequentially at several (more than two) sites. 
The arrangement that requires that shortest communication delay will have the first 
site invoke the second, ttie second invo4<e the third, wid so on, until the last return to 
the first. This is not possible within the RPC paradAgnu 

Another type of communication primitives that has been proposed is broadcast 
messages [12]. CcMnmunication cost can be reduced when implementing, say, 



210 



replicated objects. In particular, the messages that need to be relayed through the 
MTA's described in Chapter 7 can be minimized. 

Incorporating new communication primitives requires much rethinking of the design 
and implementation of a system. For in^ance, it is unclear how a nested action tree 
can be defined when the control structure of the computation does not follow a 
nested tree of invokes and returns. 

There is also the problem of language design. A simple semantics of the 

communication primitives should be presented to the programmers. When the 

communication primitives are implemented cmi an unreliable network, the 
implementation shoukJ be ^icient and yet conform to the semantics. 

8.2.2 Hardware Configuration and Reliability 

We have assumed in this thesis that each site is equipped with stable memory. This 
is not necessarily true for most personal vM>rkstatibn& One solution is to provide 
stable memory servers shared by the sites without stable memory. The protocol 
between the sites and the stable memory servers mu^ not only be efficient, but also 
provide a reliable service sekJom interrupted by site creches. For example, if the site 
on which a resource manager reskles crashes, one should be able to reincarnate the 
resource manager on a different site with tfte help of the stafc»le memory server, 
without waiting for the original site to be recovered. By concentrating the stat>le 
memory of a system into fewer stable m«Knory servers, better maintenance can be 
provided to these machines and the system becomes more rsiiabie as a result. 

A more difficult requirement is for a resource manager to be at>le to continue its 
service using another stable memory server wtien the original server crashes, with or 
without atx>rting ongoing computations. The protHem is diffk:ult as the resource 
manager may not have a copy of its entire state. A le» amttitious <y>al is to provide 
some limited service, such as only allowing prepared actions to be committed. Since 
prepared actions have their changes written in the clashed stable memory server, the 



211 



new stable memory server can record the commitment and retrieve the changes from 
the crashed server later. 

Instead of stable memory servers, a system can replicate the state of a resource 
manager on multiple sites. If these sites have relatively independent failure 
characteristics, the storage reliatxiity may be as high as that provided by stable 
memory. Similar to the stable memory servers described above, the replicated state 
information can also be used to increase availability v^ien the resource manager 
crashes. 

A natural extension of this scheme is to replicate not only the ^ate that needs to be 
stored in stable memory, but also that on volattte memory. Long computations 
interrupted by site crashes are not aborted and they can resume execution as soon 
as one of the "backup" sites where the state information is replicated is chosen as 
the "primary." Obviously, a resource manager cannot ctfford to broadcast every 
nwnx>ry update to its backups. A checkpointing scheme not unlike tlie one 
described in Chapter 7 can be used to coordinate the uixlates at the backups. 

8.2.3 Replication 

A different form of replication can be used to reduce communication delay and 
increase availability (^ the system. The rei^icaHon in the previous section can be 
regarded as the replication of system-level ob^cts. Replication can also be 
implemented at the appiteation level. Con<»ivabty, an appiteatton-ievei otqect can be 
replicated in several sites with different repreaentalions. 

Replicating at the applicatton level has the advantage that the semantics of the 
applk^ation can be utilized to reduce the numbw of r^icas that have to be 
accessed. Herlihy [20] discusses i»ing the type of an operation to determine the 
quorums of replicas that need to accessed. Difftarent kinds of semantk: information 
can be used. For example, the ^ate of an air\km reaer^rtion database can be 
replicated in several sites. Each site can sell tidcets and update tiieir own replica. 



212 



The updates can be propagated to other sites after they are committed. The numt)er 
of tickets sold can be l<ept under a ceiling as long as each site is limited to sell only a 
portion of the total tickets left. Periodically the number of tickets left can be 
recalculated. 

The implementation of replicated objects in CHjr system would present an interesting 
(but not mutually exclusive) alternative to the solutions we have proposed for long 
computations. Long communication delays can be avoided if only nearby replicas 
are accessed. Implementing the replicated objects with the programming paradigms 
and mechanisms proposed in this the^s would be an interesting test for these ideas. 

8.2.4 implementation Experience 

Because the ideas proposed in this thesis have not been implemented, many of the 
system issu^ are not discussed. There is no doubt that much fine tuning of the 
system is needed to produce a practical implementation. For example, the scheduler 
of the system has to be "fair" and efficient, since there may be many pending 
processes waiting to be scheduled, some of them having been delayed for a long 
time. 

Another critical component of the system is the stable memory manager. In many of 
our arguments, we have relied on the piggybacking of stable mennory accesses to 
make the costs of our algorithms wxeptable. Careful coding is required. If the 
stable memory manager is implemented with a remote stat>le memory server, the 
system performance tiecomes even more sensitive to ttte frequency of stat>le memory 



The implementation of the communication subsystem is also left unspecified In this 
thesis. In partteular, the timeout interval is an important parameter. Too short an 
interval leads to wisted effort in checkpointing. Too k>ng an interval may jeopardize 
an uncheckpointed computaticKi and delay tfie iM^pttoation from taking other 
appropriate actions, such a& informing the user. 



213 



'BW!iwpipiji.iii[|BPii J. I ' I " 



RnaHy, the usefulness of ttie ideas proposed m this tUseis cannot be fuHy tested 
unless some appUc^ions are implefnenied. For sMispte. the use of the serial 
specifications to apec% the t)ehawor of "large" aiW ^ i C Sl l OHe would help us e<^^ 
ttie practic^ity of our oorrsc^was definition. An ImpiBmiiitation would also provide 
useful d^a in determiniR9 the merits of the d8tai«M eoMCfiiriency Gon&oi aigortthn^ 
recovery parac^gms, m$ ether impiaiBehtaiioh *iatiam <iri misri in ttiis thesis. 

8.3 Cofieiusloii 

Atomici^ is a powerfal ooncspt that ma al ta con cuffwwc y and failures in a dietrltiuted 
system, jjona cempiitaiiewaaffs ceiled tor luitianic jjpiiMinws dyeto tornidali^ fci 
commumcaHon or omsr ^pes oTI/O SMenMt the ikqp0mIs in this ttiesis provide 
solt^ons to ttie ooBeufusftcy or l e s i l i e nc e inehl e wn tNf syst e m with Jonp atomic 
compitfaiions m^ enooumar. 



214 



References 



[1] 



[2] 



[3] 



[4] 



[5] 



[6] 



[7] 



J. E. Allchin. 

An Architecture for Reliable Decentralized Systems. 

PhD theas, Georgia Institute of Technology, September. 1983. 

J. E. Allchin and M. S. McKendry. 
Synchronization and Recovery of Actions. 

In Proceedings of ACM Second Annual Symposium on Principles of 
Distributed Computing, pages 31-44. ACM, 1963. 

J. F. Barlett. 
A Nonstop Kernel. 

In Proceedings of the Eighth Symposium on Operating Systems Principles, 
pages 22-29. ACM, 1981. 

R. Bayer et al. 

Dynamic Timestamp Allocation for Transactions in Distributed Systems. 

North-Holland, 1 982, pas^ 9-20. 

C. Beerietal. 

A Concurrency Control Theory for Nested Transactions. 
In Proceedings of ACM Second Annual Symposium on Principles of 
Distributed Computing, pages 46-62. ACM, 1963. 

P. A. Bernstein, D. W. ^ipman, and W. S. Wong. 

Formal Aspects of Seriaiizability in Datak>a8e Concurrency Control. 

IEEE Transactions on Software Engineering SE-5(^:203-215, May, 1979. 

P. A. Bernstein and N. Goodman. 

Concurrency Control in Oistritxited Dats^}ase Systems. 

Computing Surveys 13(2):1 85-221, June, 1961. 



215 



[8] 



[9] 



[10] 



[11] 



[12] 



[13] 



[14] 



P. A. Bernstein. 

Two Part Proof Schema for Database Concurrency Control. 

In Proceedings of the Fifth Berkefey Workshop on Distributed Data 

Management and Computer Networks, pe^ies 71 -84. IEEE, February, 

1981. 



P. A. Bernstein and N. Goodman. 

Multiversion Concurrency Control • Theory and Algorithms. 

ACAff Transactions on Database Systems 8(4):465-483, December, 1983. 



K. P. Birman. 

Replication and Fault-Tolo^nce in the ISIS System. 
In Proceedings of the Tenth Symposium on Operating Systems Principles, 
pages 79-86. ACM, 1985. 

A. D. Birreil, R. Levin, R. M. Needham, and M. D. Schroeder. 
Grapevine: An Exercise in Distributed Computing. 
Communications of ACM 25(4):2e0-274, April, 1962. 



D. R. Boggs. 

Internet Broadcasting. 

PhD thesis, Stanford University, October, 1963. 



A. Borg, J. Baumt>ach, and S. QIazer. 
A Message System Supporting Fault Tolerance. 

In Proceedings of the Ninth Symposium on Operating Systems Principles, 
pages 90-99. ACM, 1963. 



K. P. Eswaran et al. 

The Notions of Consistency euid Predicate Locks in a Datat}aae System. 

Communications of ACM 19(11):624-633, November. 1976. 



[15] 

D. K. Gifford and J. Donahue. 
Coordinating Independent Atomic Actions. 
In Proceedings of COMPCON. IEEE, 1966. 



216 



[16] 



[17] 



[18] 



[19] 



[20] 



[21] 



[22] 



[23] 



D. K. Gifford. 

Weighted Voting for Replicated Data. 

In Proceedings of the Seventh Symposium on Operating Systems Principles. 
ACM SIGOPS, December. 1979. 

J. N. Gray. 

Notes on Data Base Operating Systems. 

In Operating Systems: An Advanced Course, Lecture Notes in Computer 
Science Vol. 60, pages 393-481 . Sprin^r-Vertag, 1978. 

J. N. Gray et al. 

The Recovery Manager of the System R Database Manager. 

Computing Surveys 13(2):222-242. June, 1981. 

M. Hamn>er and D. Shipman. 

Reliability Mechanisms for SDD-1 : A System for Di^ributed Databases. 

ACM Transactions on Database Systems 5(4):431-466, December, 1980. 

M. P. Herllhy. 

Replicated Methods for Abstract Data Types. 

PhD ttiests, Massachusetts institute erf Technology, May, 1984. 

C. A. R. Hoare. 

Monitors: An CH>erating System Structuring Concept 

Communications of ACM 17(10):549^557, October, 1974. 

B. C. Housel and C. J. Scopinich. 

SNA DistritMJtion Service. 

IBM Systems Journal 22(4):31 9-343, 1983. 

Z. Kedem and A. Silber^hatz. 

Non-Two-Phase Locking Protocols with Shared evid Exclusive Locks. 
In Proceedings of Sixth International Conference on Very Large Data Bases, 
pages 309-31 7. ACM, 1980. 



217 



[24] 



[25] 



[26] 



[27] 



[28] 



[29] 



[30] 



[31] 



H. F. Korth. 

A Deadlock-Free, Variable Granularity Locking Protocol. 

In Proceedings of the Fifth Berkeley Workshop on Distributed Data 

Management and Computer Networks, pages 105-121 . IEEE, FetHuary, 

1981. 



H. F. Korth. 

Locking Primitives in a Datat)ase System. 

Journal of the ACM 30(1):55-79, January, 1983. 

H. T. Kung and J. T. Robinson. 

On CH^timistic Methods for Concurrency Cksntroi. 

Communications of ACM 6(2):21 3-226, June, 1961 . 

L. Lamport. 

Time, Clocks, and the Ordering of Evente in a Distributed System. 

Communications of ACM 21 (7):558-565, July, 1 978. 

B. Lampson. 
Atomic Transactions. 

In Distributed Systems: Architecture and Implementation, Lecture Notes in 
Computer Science Vol. 100,chap^^^\. Springer- Vertag, 1960. 



B. H. Liskov and A. Synder. 
Exception Handling in CLU. 

IEEE Transactions on Software Engineering SE-5(6).-546-558, INk>vember, 
1979. 



B. H. Liskov and R. Sch^fler. 

Guardians and Actiorw: Linguistic Support for Robust, DistrikNjted Programs. 
ACM Transactions on Programming Languages and Systems 5(7):381-404, 
July, 1983. 

B. H. Liskov. 

The Argus Language and Astern. 

In Goos and Hartnumis, editors. Distributed Systems: Methods and Tools for 
Specification; An Advanced Course, Lecture Notes in Computer Science 
Vol 190, pages 343-430. Springer-Vertag, Bniin, 1965. 



218 



[32] 



[33] 



[34] 



[35] 



[36] 



[37] 



[38] 



[39] 



B. H. Liskov and J. Guttag. 

Abstraction and Specification in Program Development. 

MIT Press. 1986. 

B. H. Liskov and W. WeihI. 
Specifications of Distributed Programs. 
Distributed Computing 1(2), April, 1986. 

N. A. Lynch. 

Concurrency Control for Resilient Nested Transactions. 
In Proceedings of ACM SIQACT-SIGMOD Symposium on Principles of 
Database Systems, pages 166-181. ACM, 198a 

M. S. McKendry and M. Heriihy. 
Time-Driven Orphan Elimination. 

In Proceedings of the Fiftti Symposium on Reliability in Distributed Software 
and Database Systems. IEEE, 1986. 

Mesa Language Manual Version 5.0. 
1979. 



C. Mohan and B. Lindsay. 

Efficient Commit Protocols for the Tree of Processes Model of Distritxjted 

Transactions. 
Operating Systems Review 19(2):40-52, April, 1966. 

H. Garcia-Motina. 

Using Semantic Knowledge for Transaction Processing in a Distributed 

Database. 
ACM Transactions on Database Systems 8(2):186-213, June, 1963. 



W. A. Montgomery. 

Robust Concurrency Control for a Distributed Information System. 

PhD thesis, Massachusetts Institute of Technology, December, 1978. 



219 



[40] 



[41] 



[42] 



[43] 



[44] 



[46] 



[46] 



[47] 



[48] 



J. E. Moss. 

Nested Transactions: An Approach to Reliable Distributed Computing. 

Technical Report TR-260, MIT Laboratory for Ck>mputer Science, 1 981 . 

J. E. Moss. 

Checkpoint and Restart in Distributed Transaction Systems. 
In Proceedings of the Third Symposium on Reliability in Distributed Software 
and Database Systems, pages 85-89. IEEE, 1983. 

J. E. Moss, et al. 

Abstraction in Concurrency Control and Recovery Management. 

Technical Report 86-20, COINS, University ol Massachusette. Amherst, 1986. 

R. Obermarck. 

Global Deadlock Detection A^Kx^hm. 

ACM Transactions on Database Systems 7(2):187-208, June, 1982. 

B. M. Oki. 

Reliable Object Storage to Support Atomic Actions. 

Technical Report TR-308, MIT Laboratory for ComfMiter Science, 1983. 

D. C. Oppen and Y. K.Dalal. 

The Clearinghouse: A Decwitralized Agent for Locating Named Objects in a 

Distributed Envircmment 
ACM Transactions on Office Information Systems 1(3):230-2S3, July, 1983. 

C. H. Papadimitriou. 

The Serifidizi^Hty of Concurrent Datat}ase Updates. 
Journal of the ACM 26(4):631 -653, October, 1979. 

J. B. Postel. 

Simple Mail Transfer Protocol. 

Technical Report RFC 821, ISi, University cX Southern California, 1982. 

D. P. Reed. 

Naming and Synchronization in a Decentralized Computer System. 
PhD thesis, Massachusetts InstibJte of TechruHogy, 1978. 



220 



f«ir,#^p^ #p i pii » 'W i j s «jiiWi i ituiii ! ^ 



[481 



[50J 



[5tJ 



C521 



CS3J 



1541 



lee} 



[sei 



[571 



F. B. Schrn^cier. 

Byzantifi* QmmniBWi A^km: trnptefwisli^ ffi»-li8(| Pm cmaon . 

ACM Tran$a&imm <m Compimr Svt^m$^W^i^tS^^$$^ M^r. t884. 

P. M. Schvvarz and A. Z. Sp«elor. 

jvnchronif iiTQ TltiiinMl AtmlrtDt Ti<oM 

ACM Tfmmmmtm m€otm^>*' 900mm a^|:a^iiO. Augtiit. T964. 

LSha. 
MotMarOxiKumme^C^i^MKiNmmP^femmif'CwmMm^ 

rtiP tti— ii. Cimiota Miitan Uniwrito. liiroti. TMO 



D. i 

bL&^kd^ i[K&^»^^^*Sj^t^^ ^^^^^i^^aHKtA fffe^^^^A^^i^^^^^^lK 



BciMtiit DitlrtbiflQd Gompiilteio. 

m&Trmmi^l»mmSollmm» fe ig#fl»»rto #aC*^|a|gS»r'asy. Miy, 1i04. 



R*: An OMUrtflw of IIni AitLhllMlura> 

In Pr 0c»takm ^0tmtmn^4fmm9Vmtt^ ^0mmm$ 9»Oai^b9»Mi improving 

9m^^mw^^r9ww ^^rw^m rrwm^^^^^^w^m w W wm^mmmm- wiv^i^| .VKRpppw 

ccmr\mhmm^M§mnbiV'^mmmiimmimQi'9im^'f^m»rtnm. 

1984. 



221 



Unclassified 



SECURITY CLASSIFICATION OF THIS PAGE fWhmn Dmta Enlttmd) 



REPORT DOCUMENTATION PAGE 



READ INSTRUCTIONS 
BEFORE COMPLETING FORM 



1. REPORT NUMBER 

MIT/LCS/ 377 



2. GOVT ACCESSION NO 



3. RECIPIENT'S CATALOG NUMBER 



4. T[J\.Z (and Subline) 



5. TYPE OF REPORT » PERIOD COVERED 



Long Atomic Computations 



6. PERFORMING ORG. REPORT NUMBER 



7. AUTHORf^J 



Pui Ng 



8. CONTRACT OR GRANT NUMBERCi; 

Office of Naval Research 
Contract' N00014-83-K-0125 



9. PERFORMING ORGANIZATION NAME AND ADDRESS 

MIT Laboratory for Computer Science 
545 Technology Square 
Cambridge, MA 02139 



10. PROGRAM ELEMENT. PROJECT, TASK 
AREA ft WORK UNIT NUMBERS 



11. CONTROLLING OFFICE NAME AND ADDRESS 

DARPA/DOD 

1400 Wilson Blvd. 

Arlington, VA 22217 



12. REPORT DATE 

September, 1986 



13. NUMBER OF PAGES 



221 



U. MONITORING AGENCj^ NAME ft AODRESSC// Hiltmnnt from Controlling Ofllet) 



15. SECURITY CLASS, (ot Ihim roport; 



Unclassified 



15a. declassification/downgraoing 

SCHEDULE 



16. DISTRIBUTION STATEMENT (ol Ihim Rtport) 



Approved for Public Release; distribution is unlimited 



17. DISTRIBUTION STATEMENT (ol th» abatract antarmd In Block 30, II dlllaranl Irom Raporl) 

unlimited 



18. SUPPLEMENTARY NOTES 



19. KEY WORDS (Contlnum on tavrmm midm II naemaaary and Idanttly by block numbar) 



distributed systems, atomicity, concurrency control, long computations, 
recovery, fault tolerance, reliability, programning methodology. 



20. ABSTRACT CConXnuo on i 



amry and idanttly by blocic numbar) 



Distributed computing systems are becoming commonplace and offer inter- 
esting opportunities for new applications. In a practical system, the 
problems of synchronizing concurrent computations and recovering from 
failures must be dealt with effectively. Atomicity has been suggested 
as a tool that masks concurrency and failures from the users of a 
system. With synchronization and recovery mechanisms, atomic compu- 
tations appear to execute indivisibly. This dissertation addresses the 



DD 1 j°N 73 1473 EDITION OF 1 NOV SB IS OBSOLETE 

S/N 0102-014-6601 | 



Unclassified 



SECUNITY CLASSIFICATION OF THIS PAOE (Wttan Data Mntarad) 



Unclassified 



iLOUHITY CLASSIFICATION OF THIS PAGEf1«i»n Dmf Bnlmnd) 



issues in implementing long atomic computations, such as computations 
that last for hours or even days. Long computations make synchro- 
nization more difficult because their execution is more overlapped. 
They are also more likely to encounter failures in their execution. 
Three issues are raised: 

1. Should long computations be executed automatically? Or should 
atomicity be replaced with other correctness criteria to increase the 
concurrency of a system? 

2. If long atomic computations can be implemented p;'actically, 
are there implementation paradigms that application programmers can 
follow to simplify the programming effort? 

3. How can long atomic computations be made resilient to transient 
failures? 

This dissertation shows that by using the semantics of an appli- 
cation, a system that supports atomic computations can be made as 
concurrent as other systems that do not. Since atomicity is easier 
to understand than other correctness criteria, systems that support 
long atomic computations are preferable. 

Using the semantics of an application requires application- 
dependent synchronization and recovery code, which can be complicated 
and introduce subtle errors easily. Several synchronization and re- 
covery paradigms are investigated in this dissertation. The paradigms 
divide synchronization and recovery into levels so that the task at 
each level is simpler. A prograiraning interface taht hides the con- 
currency control algorithm used by a system implementation is also 
presented. 

Finally, this dissertation discusses the use of checkpoints and 
buffered messages to increase the resilience of long atomic compu- 
tations. 



Unclassified 



SCCUHITY CLASSIFICATION OF THIS PA0Ef1Wi«n 0«# Bntmnd) 



